1 Introduction

In recent years, location-based social networks (LBSNs) such as Foursquare,Footnote 1 GowallaFootnote 2 and YelpFootnote 3 have developed rapidly due to the continuous updating of mobile devices. The rich check-in data accumulated provides a valuable data source for discovering users’ preferences and delivering personalized recommendation services. Unlike traditional POI recommendation, next POI recommendation not only analyzes users’ sequential behavior patterns in check-in sequences but also considers their spatiotemporal context information. This approach exhibits strong timeliness and practicality and has a wide range of application prospects, as shown in Fig. 1.

Fig. 1
figure 1

Services of next point-of-interest

Previous works [1,2,3,4] have summarized various aspects of POI recommendation. Due to its timeliness, most recent POI recommendation methods are based on sequence models. Early studies [5, 6] combined traditional POI recommendation methods (e.g., collaborative filtering, matrix decomposition) with Markov chains to address the limitations of ignoring users’ recent preferences. Recurrent neural network (RNN) models and its variants have become essential models for modeling user check-in sequences, and works based on both GRU [7] and LSTM [8] have demonstrated good performance. Some works have also introduced graph neural networks to model spatial data and enrich contextual information. Additionally, the emergence of attention mechanisms in natural language processing has provided a promising option for sequence models.

However, there are still several challenges that remain to be addressed in this regard. Although datasets contain a large number of check-in records, user preferences are relatively fixed, leading to extremely sparse interactions between users and POIs. Moreover, sequential model learns each user’s preferences without effectively learning interactive features between different users, which poses a challenge for accurate recommendations for users with limited check-in records. Furthermore, some research that utilizes graph neural networks (GNNs) [9,10,11] has not been adequately adapted to real-world scenarios and the neighborhoods in the adjacency matrix often represent a relationship rather than a specific numerical size, which makes all neighbors appear equidistant.

We propose the Social and Spatio-Temporal aware Next-POI (SSTP) recommendation model to address the persistent challenges. To mitigate the issue of data sparsity, we enhance the representation of sequence information by incorporating multiple modalities of information in POI recommendation into the check-in sequence. Additionally, we design a two-layer sequence feature extractor that utilizes the self-attentive module and GRU module to explore users’ long- and short-term preferences. To address the problem of user cold start, we propose a user relationship-building algorithm that creates social network graphs and uses a random neighbor sampling method to reveal the preference information of user neighbors. Finally, we propose a geographical-aware graph attention network that considers the distance similarity when computing the node similarity between different POIs. This approach establishes a fine-grained distance proximity relationship between POIs and their neighbors, rather than a coarse-grained domain relationship.

The main contributions of this work are:

  • We propose a Social and Spatio-Temporal aware model SSTP, which allows us to model user preferences over POIs efficiently by fully considering multiple factors. Particularly, a geographical-aware graph attention network is devised to transform the coarse-grained relationships between POI into fine-grained relationships.

  • To address the user cold start problem, we develop a method for discovering group preference features from user social networks using random neighborhood sampling.

  • We fuse the augmentation of sequence information, the construction of high-dimensional structure graphs and the integration of contextual semantics with a framework to mitigate the issue of data sparsity.

  • Extensive experiments are conducted on two real-world datasets. The results show that our model outperforms the current state-of-the-art baseline methods.

The rest of the paper is organized as follows. Section 2 reviews related work on POI recommendations. Section 3 presents definitions and preparatory knowledge in POI recommendations. Section 4 provides the detailed information on the model SSTP. Section 5 presents and analyzes the experimental results. Finally, we conclude the paper and discuss future work in Sect. 6.

2 Related Work

In this section, we mainly introduce the research on POI recommendations in recent years.

2.1 Traditional POI Recommendation

Traditional POI recommendation methods are primarily based on the interaction matrix of users and POIs, with collaborative filtering (CF) [12] and matrix factorization (MF) [13] being the mainstream approaches. These methods also consider the influence of geographical and temporal factors. PACE [14] is proposed as a semi-supervised deep learning framework based on CF to mitigate the effects of data sparsity by predicting users’ preferences and contexts. Recommendations are made by analyzing trust between users and combining spatiotemporal factors [15]. Rank-geofm [16] integrates temporal and spatial information into user check-in sequences and calculates POI scores by matrix decomposition. Geo-Teaser [17] combines temporal POI embedding models and geo-level preference ranking to make predictions about POI.

However, most traditional approaches still rely on static mathematical statistics to analyze check-in data, making it challenging to effectively train models.

2.2 Sequential Recommendation

Sequential recommendations aim to predict the future state of a sequence by analyzing the feature information of the sequence itself. Early studies mainly used Markov chain (MC) [18] models to model sequences. FPMC [5] combines Markov chain and matrix decomposition to learn a transfer matrix for each user and introduces Bayesian personalized ranking to make predictions. However, the extreme sparsity of the data makes the Markov model nearly unusable. With the rise of deep learning, convolutional neural network(CNN) [19] and RNN [20] models have been applied to sequence recommendation. Caser [21] embeds sequence information into the temporal potential space to form a two-dimensional space, uses convolutional filters for local feature acquisition and processes these features through maximum and primitive pooling. ST-RNN [22] focuses on different time intervals with a time-specific transformation matrix and distance-specific transformation matrices for different geographical distances to model the local temporal and spatial contexts of each layer in the RNN.

The primary limitation of sequential recommendation is the lack of spatiotemporal context information, as it is not the main study of the next POI recommendation.

2.3 Next POI Recommendation

Next POI recommendations take into account the current spatiotemporal contextual information along with the user’s recent behavior in their check-in sequence. STGN [23] models personalized sequence patterns for both long-term and short-term aspects of the user’s preferences by adding time and distance gates. SASRec [24] is utilized to analyze which locations are associated with the user’s historical check-in sequence in each time segment and uses these associations to predict the next location. STAN [25] uses a two-layer attention architecture to aggregate spatiotemporal correlations within a user’s check-in trajectory, analyzing non-adjacent locations and non-sequential visits. The emergence of graph embedding and graph neural network techniques have added new ways to model graph-structured data in LBSN. SGRec [26] models POI relationships in sequences as a graph to overcome the sparsity problem, using a multitasking approach to obtain denser category transformation relationships. To improve the expressivity of representations, DRAN [27] leverages the disentangled graph convolution network (DGCN) to explicitly model different aspects of POIs when learning POI representations. GSTN [28] incorporates user spatial and temporal dependencies for next POI recommendation, including a long short-term memory (LSTM) network for user-specific temporal dependencies modeling and GSD for user spatial dependencies learning.

In contrast to existing work, we propose an analysis of users’ multi-level preferences based on data statistics. Additionally, we propose an adaptation method for effective modeling while constructing graph data to mitigate the impact of data sparsity.

3 Preliminaries

In this section, we first introduce the notations and definitions (see Table 1 for details), then perform analysis and make assumptions based on real-world datasets, and finally discuss the factors involved in the recommendation process.

Table 1 Notation used in this paper

3.1 Definitions and Notations

We denote the set of users, POIs, time and categories as \({\mathcal {U}} = \{u_{1}, u_{2}, \dots , u_{N} \}\), \({\mathcal {L}} = \{l_{1}, l_{2}, \dots , l_{M} \}\), \({\mathcal {T}} = \{t_{1}, t_{2}, \dots , t_{T} \}\), \({\mathcal {C}} = \{c_{1}, c_{2}, \dots , c_{K} \}\), respectively.

Definition 1

(POI) A POI is a unique geographical location, and it can be represented by a 2-tuple coordinate: \(\langle latitude, longitude \rangle\).

Definition 2

(Time Slot) The granularity of time stamps is too fine to capture the temporal behavior patterns of users. Therefore, time stamps are divided into time slots, with each time slot containing multiple time stamps.

Definition 3

(Check-in) A check-in is a record which can be defined as 4-tuple: \(p^{u_{i}}_{t} = \langle u_{i}, l_{j}, t, c_{k} \rangle\), and it denotes that user \(u_{i}\) has visited POI \(l_{j}\) of category \(c_{k}\) at time slot t.

Definition 4

(Check-in Sequence) A check-in sequence can be represented by a series of check-in records ordered by time stamp, and the check-in sequence of user \(u_{i}\) is defined as \(r_{u_{i}} = \{p^{u_{i}}_{t_{1}}, p^{u_{i}}_{t_{2}}, \dots , p^{u_i}_{t_{\tau }} \}\). In addition, we define the set of check-in sequences for all users as \({\mathcal {R}} = \{r_{u_{1}}, r_{u_{2}}, \dots ,r_{u_{N}} \}\).

Definition 5

(POI Adjacent Graph) If the distance between two POIs is less than a certain distance threshold, then the two POIs can be considered neighbors of each other. It can be defined as \({\mathcal {G}}_{g} = \langle {\mathcal {V}}_{g}, {\mathcal {E}}_{g} \rangle\), where \({\mathcal {V}}_{g}\) and \({\mathcal {E}}_{g}\) are the set of nodes and edges, respectively.

Definition 6

(User Social Network) If the similarity of the set consisting of all check-in records of two users is high (\(sim = \frac{|r_{u_{i}} \cap r_{u_{j}}|}{min(|r_{u_{i}}|, |r_{u_{j}}|)}\)), it indicates the existence of a relationship edge between two users. This relationship can be represented by \({\mathcal {G}}_{u} = \langle {\mathcal {V}}_{u}, {\mathcal {E}}_{u} \rangle\), where \({\mathcal {V}}_{u}\) and \({\mathcal {E}}_{u}\) are the set of nodes and edges, respectively.

Definition 7

(Top-K POI Recommendation) Finally, given the user check-in sequence \(r_{u_{i}}\), the POI candidates \({\mathcal {L}}_{c} = \{ l_{1}, l_{2}, \dots ,l_{M} \}\), the spatiotemporal contextual information. We feed them into the model to score POIs in the candidate and select the Top-k POIs with the highest scores as recommendation candidates set.

3.2 Factors Analysis and Intuition

3.2.1 Interest Preference

In recommendation services, the user’s check-in behavior preferences are crucial, and the check-in sequence contains rich information about the user’s preferences. However, as the check-in records accumulate over time, the preference features within the sequence become more complex. Figure 2 displays the time interval between the earliest and latest visits to the same POI by users in both the NYC and TKY datasets. It is clear that users visit POIs more frequently within 30 days, but the proportion of POIs beyond 30 days is still high. Therefore, we separate preferences into long-term and short-term aspects. Long-term preferences are characterized by stable preferences that do not change over time, while short-term preferences reflect the user’s recent habits. These two aspects of preferences are combined to form the overall user preference.

Fig. 2
figure 2

Statistics of time interval in the same POI. The X-axis indicates the number of interval days, and the Y-axis indicates the number of POIs

3.2.2 Time Period

There are always periodic patterns in people’s daily lives, and individuals have different behavioral traits at different times. Figure 3 shows that although users behave differently, they tend to visit the same POI in a close time period.

Fig. 3
figure 3

Check-in time of different users at the same POI. The X-axis represents check-in sequence, and the Y-axis represents the time of check-in

Figure 4 shows the visit activity of users in two datasets within one week. It is easy to see that users’ check-in activity is close to daily, which reflects the periodicity of users’ behavior.

Fig. 4
figure 4

User activity in two real-world dataset. We can find that the activity of users is cyclical during the week

3.2.3 Geographical Distance

In LBSN, users are limited by the physical distance between locations in the real environment, making it challenging to travel to two locations with a significant distance interval in a short time. The distance between a user’s current POI location and the next POI location can impact their final choice, as depicted in Fig. 5.

Fig. 5
figure 5

User sensitivity to distance

It is evident that users tend to choose locations with shorter distances between successive POI check-ins, with the number of users decreasing rapidly as the distance increases. Figure 5b illustrates the mean distance of POIs across all check-in records for different users, indicating that users’ activity regions are typically constrained. Thus, distance is a significant factor in determining a user’s next POI selection.

3.2.4 Location Category

In POI recommendation systems, the number of POIs is typically much larger than the number of POI categories. This results in a user category interaction matrix that is typically denser than the user POI interaction matrix, as depicted in Fig. 6. The high density between users and POI categories presents an opportunity to address the data sparsity problem in POI recommendations.

Fig. 6
figure 6

Confusion matrix. It can be seen that there are significantly more user category interactions than for user POI

3.2.5 Social Influence

As a subset of social networks, LBSNs also contain networks of friends among users. The exchange of information among friends also provides an opportunity for the propagation of user interest preferences, which enables the spread of preference features in the network. This approach is similar to the idea of collaborative filtering.

Based on past research articles on POI recommendation, it is easy to find that the use of reasonable hypothesis inference can effectively analyze user check-in data and spatiotemporal context, which facilitates the targeted design of the model. Based on the above-mentioned analysis of influencing factors and dataset research, the following hypotheses are proposed in this paper:

Intuition 1. Users’ check-in records do not simply reflect users’ single preferences but are rich in the hierarchy, and user check-in sequences can be better analyzed through time segmentation.

Intuition 2. Users’ behaviors in time are cyclical, i.e., users’ behaviors tend to be similar in intervals close to time, such as going to work and going home. At the same time, as time changes, user check-in behavioral activity tends to vary across time periods.

Intuition 3. Influenced by geographical distance, users tend to choose a closer POI as the next check-in location than a more distant POI, and this tendency will become more and more obvious as the distance increases.

Intuition 4. Users’ interest preferences can be spread within the social circle of LBSN, and in the meantime, users can also get new interest preferences from the social circle.

4 Approach

In this section, we introduce the proposed model SSTP, beginning with an overview of its architecture and then detailing each module. Our model consists of the following modules: (1) a multimodal feature embedding module to learn the representation of users, POI, time and category; (2) a user preference tracking module that captures the feature of users’ check-in behavior; (3) a geographical awareness module to find the distance relationships between POIs; and (4) a social influence module to learn the effects among users. Figure 7 shows the architecture of SSTP.

To clarify, the model is divided into four parts, separated by three dashed lines in the figure, each focusing on one of four aspects: long-term preference, short-term preference, geographical location and social relationship. The long-term preference module aims to explore the user’s entire sequence of actions, while the short-term preference module focuses on their recent check-in behavior. The geolocation module analyzes the distance between different locations and leverages a custom GAT to extract the geospatial features of the current location. The social relationship module examines the association of preferences between users and utilizes random sampling to enhance their features. Finally, the outputs of all four modules are aggregated to generate the final feature vector used for recommendation.

Fig. 7
figure 7

Overall architecture of the SSTP. It contains five modules: user preference tracking module, geographical awareness module, social influence module, multimodal feature embedding module and candidate set prediction module, and \(r^{l}_{u_{i}}\), \(r^{s}_{u_{i}}\), \({\mathcal {G}}_{g}\) and \({\mathcal {G}}_{u}\) are the data we feed into model, and after processing by each module in the SSTP, we get the corresponding feature representations \(P^{l}_{u_{i}}\), \(P^{s}_{u_{i}}\), \(H^{}_{u_{i}}\) and \(E^{}_{u_{i}}\). Finally we sum these features to get the final feature \(F_{u_{i}}\)

4.1 Multimodal Feature Embedding

The aim of this module is to transform numerical scalars into richer information including user ID, POI ID, time ID and category ID, and then to construct advanced data structures such as POI adjacency graphs and user social networks based on these pieces of information. The check-in sequence of a given length n can be defined as \(r_{u_{i}} = \{ p^{u_{i}}_{t_{1}}, p^{u_{i}}_{t_{n}},..., p^{u_{i}}_{t_{n}} \} = \{ \left\langle u_{i}, l_{1}, t_{1}, c_{1} \right\rangle , \dots , \left\langle u_{i}, l_{j}, t_{n}, c_{k} \right\rangle \}\), as shown in Fig. 8. Based on the idea of word2vec [29], we map these multimodal scalar numbering information into low-dimensional dense feature vectors through the embedding layer and denote the representations of users, POI, time and category as \(e_{u_{i}} \in {\mathcal {R}}^{d}\), \(e_{l_{j}} \in {\mathcal {R}}^{d}\), \(e_{t_{\tau }} \in {\mathcal {R}}^{d}\) and \(e_{c_{k}} \in {\mathcal {R}}^{d}\), respectively.

Fig. 8
figure 8

Detail of check-in sequence. Each check-in record contains a user ID, POI ID, time stamp and category ID

$$\begin{aligned} {\left\{ \begin{array}{ll} e_{u_{i}} = \textrm{Embedding}_\textrm{user}(u_{i}) \\ e_{l_{j}} = \textrm{Embedding}_\textrm{poi}(l_{j}) \\ e_{t_{\tau }} = \textrm{Embedding}_\textrm{time}(t_{\tau }) \\ e_{c_{k}} = \textrm{Embeddings}_\textrm{category}(c_{k}) \\ \end{array}\right. } \end{aligned}$$
(1)

Embedding\((\cdot )\) denotes an embedding layer. Then we stitch these features together to get a user’s record feature vector \(e_{j} = (e_{l_{j}} \oplus e_{t_{\tau }} \oplus e_{c_{k}}) \in {\mathcal {R}}^{3 * d}\), and the embedding feature vector of the whole sequence \(r_{u_{i}}\) can be defined as \(E_{} \in {\mathcal {R}}^{n \times 3*d}\).

$$\begin{aligned} E_{} = \left[ e_{1}, e_{2}, \dots , e_{n} \right] \end{aligned}$$
(2)

For the POI adjacent graph and user social network, we use \(h_{j} = e_{l_{j}} \oplus e_{c_{k}} \in {\mathcal {R}}^{2*d}\) and \(e_{u_{i}} \in {\mathcal {R}}^{d}\) as the initial feature representation \({\mathcal {F}}_{g}\) and \({\mathcal {F}}_{u}\) of each node in the graph, respectively.

$$\begin{aligned} {\mathcal {F}}_{g}&= \{ h_{1}, h_{2}, \dots , h_{M}\} \end{aligned}$$
(3)
$$\begin{aligned} {\mathcal {F}}_{u}&= \{ e_{u_{1}}, e_{u_{2}}, \dots , e_{u_{N}}\} \end{aligned}$$
(4)

4.2 User Preference Tracking

We learn user preference features from both long- and short-term aspects. The modeling of the self-attention mechanism focuses on the entire sequence and is therefore set as long-term features, while the modeling of the GRU focuses on temporal relationships and is therefore set as short-term features. Intuitively, the periodicity of long-term habits is not significant and may not feed back on recent behavioral characteristics of users. This means that it has the feature of being relatively stable not changing over time. Based on this, we use an attention mechanism to model long-term preference features.

For the sequence \(r_{u_{i}}\), we can obtain the long-term sequence \(r^{l}_{u_{i}}\) and the short-term sequence \(r^{s}_{u_{i}}\). For convenience, we select the same sequence (\(r^{s}_{u_{i}}\) = \(r^{l}_{u_{i}}\)) and use different models to construct long-term feature \(E^{l}_{u_{i}}\) and short-term feature \(E^{s}_{u_{i}}\).

$$\begin{aligned} E^{l}_{u_{i}} = \textrm{MultiModalEmbedding}(r^{l}_\mathrm{u_{i}}) \end{aligned}$$
(5)
$$\begin{aligned} E^{s}_{u_{i}} = \textrm{MultiModalEmbedding}(r^{s}_\mathrm{u_{i}}) \end{aligned}$$
(6)

MultiModalEmbedding\((\cdot )\) denotes the multimodal feature embedding module. We use the attention mechanism [30] to learn the long-term POI transfer relationship of check-in sequences in non-adjacent.

$$\begin{aligned} P^{l}_{i}&= SA(E^{l}_{u_{i}}) = \textrm{Attention}(W^{Q}_{i}E^{l}_{u_{i}}, W^{K}_{i}E^{l}_{u_{i}}, W^{V}_{i}E^{l}_{u_{i}}) \end{aligned}$$
(7)
$$\begin{aligned} P^{l}&= \textrm{Concat}(P^{l}_{1}, P^{l}_{2}, \dots , P^{l}_{h})W = (P^{l}_{1} \oplus P^{l}_{2} \oplus \dots \oplus P^{l}_{h})W \end{aligned}$$
(8)

Furthermore, considering the complexity of long-term features, we further use multi-head attention for feature extraction and better modeling of long-term preferences. W denotes the trainable matrix.

Unlike long-term historical preferences, short-term preferences have significant periodic and temporal sequential characteristics, for which we use GRU to model them. This is well suited for modeling them with RNN models, its formula is defined as:

$$\begin{aligned} h^{s}_{t}, P^{s}_{} = GRU(h^{s}_{0}, E^{s}_{u_{i}}) \end{aligned}$$
(9)

The detail is shown in Fig. 9a and b. \(GRU(\cdot )\) denotes the GRU model, and long-term and short-term preferences express the overall preferences of users.

Fig. 9
figure 9

Detail of modules

4.3 Geographical Awareness

We propose a geo-aware graph attention network (GA-GAT) model to find the distance relationship between POIs. Based on the extension of the idea of GAT [31], we introduce the distance similarity \(\beta\) between each POI node in addition to the similarity \(\alpha\) between their feature vectors h when calculating the similarity between them, as shown in Fig. 9c and the equation of the module is defined as follows:

$$\begin{aligned} \alpha _{ij}&= softmax(s_{ij}) = \frac{exp(a(W_{\alpha }h_{i}, W_{\alpha }h_{j}))}{\sum ^{}_{k \in {\mathcal {N}}_{i}} exp(exp(a(W_{\alpha }h_{i}, W_{\alpha }h_{k})))} \end{aligned}$$
(10)
$$\begin{aligned} \beta _{ij}&= softmax(d(i, j)) = \frac{exp(1 / d(i, j))}{\sum ^{}_{k \in {\mathcal {N}}_{i}} exp(1 / d(i, k))} \end{aligned}$$
(11)
$$\begin{aligned} h'_{i}&= \sigma \left(\sum ^{}_{j \in {\mathcal {N}}_{i}} \frac{\alpha _{ij} + \beta _{ij}}{2} W_{\beta }h_{j}\right) \end{aligned}$$
(12)

where \(\alpha _{ij}\) denotes the normalized attention coefficient between two POI nodes; \(\beta _{ij}\) denotes the normalized distance coefficient between two POI nodes; \(a(\cdot , \cdot )\) denotes the similarity calculation function; \(d(\cdot , \cdot )\) denotes the distance between two POI nodes, which can be calculated using the haversine function; and \({\mathcal {N}}_{i}\) denotes the neighboring POI nodes of POI \(l_{i}\).

Based on the information of each POI node in a sequence \(r_{u_{i}}\), a geospatial feature vector is constructed as \(H_{u_{i}} \in {\mathcal {R}}^{n \times 2*d}\).

$$\begin{aligned} H_{u_{i}} = \left[ h'_{1}, h'_{2}, \dots , h'_{n} \right] \end{aligned}$$
(13)

4.4 Social Influence

Traditional user information utilization only includes information about the users themselves and rarely considers interactions between users. However, similar users may have similar preferences. Based on this, we would like to construct social network relationships similar to the similarity between users from a known dataset in order to explore the spread of interest among users.

To enhance user information features, we propose a random neighborhood sampling algorithm to analyze the propagation relationships in user social networks. Similar to the geographical awareness module, users have multiple neighbors in their social network. Considering that randomly sampling several neighbors would lead to uncertain sampling numbers, high model complexity and low efficiency, we only sampled one neighbor \(u_{j}\) of user \(u_{i}\) to balance model efficiency and performance. In addition, during each iteration, we randomly select the POI \(l_{j}\) that \(u_{j}\) is interested in to update the features of user \(u_{i}\): \(e_{u_{i}} = e_{u_{i}} + e_{u_{j}} + e_{l_{j}}\), as shown in Fig. 9d.

When constructing features, we embed user-level information into a final feature vector of length n. If only the user’s own feature information is added at each location, the resulting vector will not be diverse. Therefore, we set the user features to be a combination of the user’s own features, the features of the neighbors and the features of a particular location that the neighbors like at each location during the embedding process. This procedure is repeated n times and the corresponding positions in the final feature vector are embedded. The benefit of this approach is that it increases the randomness of user features and enhances the representation of the final feature vector.

After n iteration, the user social information feature vector can be defined as \(E_{u_{i}} \in {\mathcal {R}}^{n \times d}\).

$$\begin{aligned} E_{u_{i}} = \left[ e^{1}_{u_{i}}, e^{2}_{u_{i}}, \dots , e^{n}_{u_{i}} \right] \end{aligned}$$
(14)

4.5 Candidate Set Prediction

The candidate set prediction module primarily utilizes the contextual semantic information obtained from all the modules mentioned above to analyze and compute the next candidate set for user POI recommendation. We obtain the long-term historical features \(P^{l}_{u_{i}}\) and short-term preference features \(P^{s}_{u_{i}}\) in the sequence based on the input sequence \(r_{u_{i}}\), and fuse them to obtain the final preference feature representation \(P^{}_{u_{i}} = P^{l}_{u_{i}} \oplus P^{s}_{u_{i}}\).

In addition, the current geospatial features \(H_{u_{i}}\) and users’ social information features \(E_{u_{i}}\) are combined to form the final recommendation contextual semantic feature vector \(F_{u_{i}} = P^{}_{u_{i}} + H^{}_{u_{i}} + E^{}_{u_{i}}\), on which the probability distribution of the next POI candidate set is calculated based, as shown follows.

$$\begin{aligned} p = softmax(W_{p}(P^{}_{u_{i}} + H^{}_{u_{i}} + E^{}_{u_{i}})) \end{aligned}$$
(15)

We select the Top-k highest scores from p as the recommendation list for the next POI.

4.6 Model Training

In this paper, we use the binary cross-entropy loss as the objective function, and we add a regular term to the loss function to prevent overfitting during the training. In summary, the loss function is defined as follows.

$$\begin{aligned} {\mathcal {J}} = \sum ^{{\mathcal {U}}}_{u=1}\sum ^{{\mathcal {L}}}_{j=1}(-log(\sigma (y^{poi}_{ij}))) + \sum ^{{\mathcal {U}}}_{u=1}\sum ^{{\mathcal {L}}}_{j=1}(-log(\sigma (y^{cat}_{ij}))) + \lambda ||\theta || \end{aligned}$$
(16)

Our model can predict both the next POI and the next category for both tasks. In order to use the category information as much as possible, we include the loss of the categories in the calculation.

5 Experiments

In this section, we evaluate the performance of our model using two real-world datasets and compare it with other current models in related work. We first introduce the comparison datasets, evaluation metrics, baselines and hyper-parameter settings. Next, we present the results of our experiments. Finally, we perform additional experimental analysis.

5.1 Experimental Setup

5.1.1 Datasets and Data Preprocessing

We evaluate the models on two public real-world datasets: NYC and TKY, which contain approximately 10 months of user mobile check-in data from April 2012 to February 2013 collected by Foursquare.Footnote 4 Each check-in record in the dataset includes a user ID, a POI ID, a category ID, a time stamp, and the latitude and longitude of the POI. To minimize interference from noisy data, we remove users with fewer than 10 check-in records and POIs with fewer than 10 visits. The preprocessed data statistics are shown in Table 2. For the user check-in sequence of length n, it can be divided into three parts: The train set for the first \(n - 2\) check-in records, the validation set and test set are the first \(n - 1\) and the first n check-in record, respectively.

Table 2 Statistics of datasets

5.1.2 Baselines

To demonstrate the effectiveness of our SSTP, we compare it with the following baselines, which have the next POI recommendation capability:

  • ST-RNN [22] is an extended RNN model, which incorporates continuous time and distance interval information in the check-in sequence into the gated input of RNN.

  • SAE-NAD [32] uses a self-attentive encoder and a neighbor-aware decoder to learn users’ preferences and POI’s neighbor information.

  • SASRec [24] uses a self-attention layer to learn the latent preferences of users in the check-in sequence.

  • DeepMove [33] uses the attention mechanism and RNN module to learn users’ long- and short-term preferences.

  • Caser [21] uses convolutional filter to model the spatiotemporal features of 2D images of check-in sequences.

  • TiSASRec [34] is based on SASRec by considering the time interval between consecutive check-in behaviors and proposes an attention mechanism with temporal awareness.

  • PLSPL [35] uses two LSTM modules to learn users’ POI and category transfer relations while using an attention module to obtain users’ preference information.

  • STAN [25] constructs temporal and spatial Euclidean mapping spaces from sequence separately and uses a self-attentive layer to obtain POI transfer relations.

  • LightMove [36] designs a model based on a neural ordinary differential equation to model long-term preference and uses an attention module to learn short-term preference.

5.1.3 Evaluation Metrics

We adopt Hit@k and MAP@k to evaluate the performance of different models. Hit@k indicates how many correct samples are predicted in the test set and is used to measure the accuracy of the model predictions. MAP@k indicates how high the correct samples are ranked in the model recommendation list and is used to measure the performance of the model in ranking recommendations, and intuitively, the higher the ranking is, the better performance it has. Their formulas are defined as follows:

$$\begin{aligned} Hit@k&= \frac{1}{\left| {\mathcal {U}} \right| } \sum ^{}_{u \in {\mathcal {U}}} \frac{\left| {\mathcal {L}}^{k}_{u} \cap {\mathcal {L}}^{T}_{u} \right| }{\left| {\mathcal {L}}^{T}_{u} \right| } \end{aligned}$$
(17)
$$\begin{aligned} MAP@k&= \frac{1}{\left| {\mathcal {U}} \right| } \sum ^{}_{u \in {\mathcal {U}}} \frac{1}{{\mathcal {P}}^{k}_{u}} \end{aligned}$$
(18)

where \(k = \{1, 5, 10, 20\}\) denotes the length of the model’s recommendation POI list; \({\mathcal {U}}\) denotes the set of users in the test set; \({\mathcal {L}}^{k}_{u}\) denotes the POI recommendation candidates set of length k generated by the model for user u; \({\mathcal {L}}^{T}_{u}\) denotes the set of POIs for user u in the test set T; and \({\mathcal {P}}^{k}_{u}\) denotes the position of POI in recommendation candidates list.

5.1.4 Hyper-parameter Settings

The hyper-parameters of the SSTP model are as follows:

  • The embedding dimension dim is 100.

  • The maximum check-in sequence input length len is 50.

  • The initial learning rate lr is 0.001.

  • The number of time intervals timeslot is 24.

  • The minimum association distance limit between POIs is 3 (km).

  • The minimum similarity threshold sim is 0.4.

  • The ratio of category loss value \(\mu\) is 0.3.

  • The optimizer we use is Adam.

5.2 Comparison of Experimental Results and Analysis

To ensure accuracy and fairness in the experimental results, we conducted each model’s experiment five times and recorded the best results from each round. We then calculated the average of all the results to obtain the model’s performance result.

Table 3 shows the recommended performance comparison between our model and the baseline models on the two datasets, and results reveal that SSTP outperforms the baseline method model on both datasets and all five performance evaluation metrics, except for Hit@10 on the NYC dataset, which is slightly lower than the DeepMove. Taking the evaluation metrics of Hit@5 and MAP@5 as examples, SSTP improved 3.28\(\%\) and 5.73\(\%\) respectively, over the optimal baseline method on the NYC, and 4.41\(\%\) and 5.44\(\%\) respectively, over the optimal baseline method on the TKY.

Table 3 Result of SSTP and baselines (%)

In Table 3, the performance effects of the baselines show three different stages overall, which can be divided into low (ST-RNN, SAE-NAD), medium (STAM, LightMove) and high (SASRec, DeepMove, Caser, TiSARec, PLSPL) levels with Hit@1 as an example.

The ST-RNN model incorporates temporal and spatial information from the input sequence into the gating mechanisms of the RNN. However, sparse data information is still difficult to express the current contextual semantics after matrix operations. SAE-NAD also performs a large number of matrix operations, while the architecture of the autoencoder implies the loss of feature information, which leads to the low effectiveness of both approaches.

STAN models time and distance intervals, but the implementation procedure performs multiple lifts and reshapes of features, which reduces the feature expression capability. While the ordinary differential equation neural network of LightMove is essentially an RNN model, the input format of LightMove is a short sequence of sessions, which results in feature learning fragmentation. In their comparison, LightMove outperforms STAN because LightMove models both long-term and short-term preferences and handles them better than STAN.

SASRec, DeepMove, Caser, TiSARec and PLSPL have all demonstrated good performance. But drawbacks are also evident, where some methods ignore the spatiotemporal semantic information of POI and others only consider space and ignore time, thus not fully exploiting all the modal information. The reason why SSTP has generally improved its effectiveness in Hit@k evaluation metrics is that it constructs corresponding modules for all-round processing based on the hierarchical characteristics of users’ preferences, and at the same time, it combines the spatiotemporal contextual information of the environment with the social network information of users, to fully exploit the auxiliary information of POI.

It is worth noting that between consecutive k, the improvement between the same models is getting smaller and smaller. For example, the MAP@5 and MAP@10 indicators of SSTP on the NYC only differ by 0.68\(\%\). The reason is that as the recommendation list length k increases, the location of the next correct POI predicted by the model will be closer to the back side of the list. As shown in Eq. 18, the further back the position is, the smaller the value of the MAP@k is, which leads to slower and slower growth.

5.3 Hyper-Parameters Sensitivity Analyses

We explore the impact of different hyper-parameter values on the performance of the SSTP model, and two important parameters are selected in this section: (1) dimension of Embedding feature vector \({\textbf {dim}}\) and (2) maximum length of user check-in sequence \({\textbf {len}}\).

The results for the first two parameters are shown in Fig. 10. It is not difficult to see from Fig. 11a and b that SSTP achieves the worst performance when \(dim=10\) because the dimension is too small to learn the expressive feature vector, which further makes the model unable to learn and distinguish the current contextual information effectively. With the increase in the dimensionality, the SSTP model performance also gradually improved and leveled off, reaching the maximum at \(dim=100\). The check-in sequence length \({\textbf {len}}\) determines the amount of information fed to the model, and Fig. 10d shows that the trend of model performance increases as len increases because longer sequences contain more information. Although longer len is better, in this paper we still set its optimal parameter to 50 because other comparison baselines have a maximum of 50 at the input sequence, and we make this decision for fairness.

Fig. 10
figure 10

Impact of hyper-parameter

To examine the effect of the category loss on model performance, we conducted experiments with varying values of \(\mu\), ranging from 0.0 to 1.0, and analyzed the results. The impact of category loss on the experiments is presented in Table 4, which demonstrates that introducing the class loss can generally improve the model’s performance for most values of \(\mu\). Notably, the results are nearly optimal when \(\mu\) = 0.3, confirming the effectiveness of the class loss and identifying 0.3 as the optimal value for \(\mu\).

Table 4 Impact of category loss in experiments (%)

5.4 Ablation and Unit Study

To analyze the roles and contributions of different modules in our model, we conduct an ablation study and some unit module studies in this section. Four degraded models can be obtained by dismantling the module of SSTP as follows: SGSP (remove long-term preference), SGLP (remove short-term preference), SLSP (remove geographical influence) and GLSP (remove social influence). Meanwhile, each module can be a separate model as follows: OLP (only long-term preference), OShP (only short-term preference), OGP (only geographical influence) and OSoP (only social influence). In addition, to ensure the brevity of data, this experiment is conducted only on the NYC dataset. The results of the ablation study are shown in Table 5, and the results of the unit module study are shown in Fig. 11.

Fig. 11
figure 11

Result of unit module study

The performance of GP is the worst among all the modules, since in OGP only the geographical distance relation between POIs is learned and no information on the user’s interest preferences is obtained. The performance of OShP and OLP learn user preferences from both short-term and long-term aspects, and the effect is comparable. The performance of OSoP is not as good as that of OShP and OLP, but still reaches a good level, indicating that preference features can be learned through social communication. In particular, the performance of OSoP and OGP is very close when \(k=1\), and with the increase of k, the performance of OSoP improves rapidly, which proves its effectiveness.

Table 5 of the ablation experiments shows that the GLSP performance is the worst, which indicates that there is a significant improvement in the performance for SSTP after the SoP is added because the propagation method makes the users’ preference features multi-level, which again verifies the effectiveness of SoP.

For SGSP and SGLP, they eliminate long-term and short-term preferences respectively, but SGSP performs better, which indicates that short-term preferences are more important than long-term preferences. This is probably because short-term preferences can better reflect the recent behavioral characteristics of users, and thus provide more accurate recommendation services. The reason is that GP only reflects distance relations, not preference changes.

Table 5 Ablation study result of SSTP (%)

5.5 Sparsity Study

To explore the performance of SSTP on different sparsity datasets, we conduct a data sparsity study in this section. There is a dataset named \(NYC_{sub}\), which is selected from NYC about 50% of the user check-in data. The data statistics of these datasets are shown in Table 6, and the experimental results are shown in Table 7.

Table 6 Dataset statistics of sparsity dataset
Table 7 Comparison between NYC and \(NYC_{sub}\) (%)

The baselines are chosen for results that are close to each other. Table 7 shows that the SSTP improves the baselines much more on the sparser NYC (0.6409%) than on the denser \(NYC_{sub}\) (0.9890%), which shows that SSTP has better performance when dealing with sparse datasets. This is due to the fact that SSTP makes full use of contextual information, such as categories and social, and integrates it into a unified framework.

6 Conclusion and Future work

In this paper, we propose an SSTP model that integrates multiple factors in the next POI recommendation service.

For the hierarchical characteristics of user check-in preferences, we use the self-attention mechanism and GRU module to learn long-term and short-term preference features, respectively. To mitigate the effect of data sparsity, the content of the check-in data is expanded and high-dimensional semantic graph structure data are constructed. A geographical-aware graph attention network is proposed, which doubly models the distance and uses random neighbor sampling to augment the representation of user information. Our extensive experiments validate the effectiveness and utility of SSTP.

In future work, we plan to improve the SSTP model by incorporating advanced techniques beyond the base models, such as self-attention models and GRU, which we used to model user preferences solely from check-in sequences in this paper. Additionally, we plan to leverage the availability of current time information in practical scenarios to augment our modeling approach whenever possible. Addressing the time complexity of the algorithm is also a critical concern for us, and we will focus on optimizing it to enhance the efficiency of the model.