1 Introduction

A recommendation system is an effective and powerful tool to address information overload by making users easily capture information in terms of their preferences and allowing online platforms to widely publicize the information they produce. Different from traditional recommendation systems, such as content-based and collaborative filtering systems, which prefer focusing on the user’s static historical interactions and assume that all user–item interactions in the historical sequence are equally important, sequential recommendation (SR) systems aimed at catching the dynamic preferences of users from their historical interaction. Therefore, SR is more consistent with real-world situations to obtain more accurate results with more information considered.

Generally, the conventional SR methods are classified into traditional machine learning approaches and deep learning-based approaches. The traditional machine learning (ML) approaches are mainly exploited to build up the sequence data, such as k-means, Markov chains, and matrix decomposition [1,2,3,4]. Especially, matrix factorization (MF) can obtain users’ long-term preferences through various sequences [5]. Nowadays, the sequential recommendation models based on the deep learning approaches have achieved relatively good results in reality due to their excellent performance in information retrieval and filtering, mainly including convolutional neural networks (CNNs) [6] and recurrent neural networks (RNNs) [7]. This is because DL-based SR approaches utilize a much longer sequence which is beneficial to the semantic information of the whole sequence [8]. In addition, DL-based SR approaches also have achieved much better performance for both the sparse data and different length sequences.

However, there are still some challenges that still should be overcome in the above DL-based SR approaches. A major disadvantage of them is that they are only considering users’ short-time interests while ignoring their long-term preferences. Moreover, they are incapable of considering the influence of time information in the original interaction sequence, which could have been helpful to fully extract various patterns by finding different temporal embedding forms. Therefore, this paper aims to propose a deep learning-based sequential recommendation system. It utilizes both the global preference and the local preference of users with the multi-temporal embedding to encode both absolute and relative positions of a user–item interaction, and simulates the real-time recommendation environment.

This paper proposes a novel model named multi-temporal sequential recommendation model based on the fused learning preference named MTSR-FLP. The proposed MTSR-FLP model consists of four parts: multi-temporal embedding learning; local representation learning; global representation learning; item similarity gating module. First, the multi-temporal embedding learning module is the most important and innovative in the proposed model. It is well designed especially to address the timestamp information with the construction of the multiple temporal embedding, which will be sent to the local representation learning module, named LRL in short. Moreover, the local representation learning module is developed via the framework of the SASRec model [9], which infers the multiple temporal embedding with the different weight and their aggregations, to obtain the local preference of users. Afterward, in the global representation learning module, named GRL in short, a novel location-based attention layer has been developed to obtain the final global learning representation of all users. Finally, the proposed model designs a novel item similarity gating module, which is critical in coordinating the local preference of the specific user and the global preference of all users. Meanwhile, it considers the influence of candidate items by calculating the similarity of candidate items and recently interacted items.

The main contributions of the proposed MTSR-FLP model are described as follows: (1) the proposed model develops a multiple temporal embedding scheme to encode the positions of user-item interactions of a sequence; (2) to capture the local representation of a user via his historical information, the MTSR-FLP model proposes a local representation learning module; (3) to effectively capture the user’s global preference, the proposed model designs a global representation learning module; (4) compared with the four traditional models on five public datasets, experimental results demonstrated that the proposed model had better performance regarding the effectiveness and feasibility of the sequence recommendation.

The rest of this paper is organized as follows. Section 2 briefly introduces the recent development of sequential recommendation based on deep learning and temporal recommendation. Section 3 describes the specific details of the proposed model and the procedures of its training. Experimental results on five public datasets and comparison with existing models are analyzed in Sect. 4. Finally, Sect. 5 gives a conclusion of the paper study and the directions for future work.

2 Related Work

2.1 DL-Based Sequential Recommendation

Recently, many researchers utilize DL-based sequential recommendation (SR) models to obtain nonlinear and dynamic features. In particular, DL models based on the RNNs are nearly the first to be incorporated into sequential modeling, which naturally models sequences step-by-step [10, 11]. Moreover, GRU4Rec model [12] has been proposed to effectively extract the position information between adjacent sequences, and GRU4Rec+ model [13] adopts various sampling strategies and loss functions. However, these models are more dependent on the user’s recent interactions, weakening the training efficiency Afterward, some researchers utilized CNNs-based models to address the vanishing gradient issue resulting from the RNNs-base SR models. For example, Caser model [5] learns sequence patterns using convolutional filters on the local sequences with the dense users’ behavior data. A simple convolutional generative network is proposed to learn high-level representation from both short- and long-range item dependencies [14].

Since the attention mechanisms have an advantage in their proper hierarchical structure, some DL-based models utilize them to solve sequential recommendation problems. For example, as an entire and famous attention mechanism, the Transformer model [15] computes representations of its input and output without RNNs or CNNs, which makes the Transformer more parallelizable and faster to be trained with superior translation quality. However, the Transformer model does not directly utilize the position information in the sequence and cannot extract local features [16]. In addition, some SR models have been developed through the self-attention mechanism. For example, SASRec [9] not only can uncover the long-term user's preference but also can predict the next recommended item mainly based on the recently interacted items. Bert4Rec [17] captures the correlation of items in both left-to-right and right-to-left directions to obtain a better sequence representation of user sequence. LSSA [18] deals with both long-term preferences and sequential dynamics based on a long and short-term self-attention network. However, they are limited to only capturing the sequential patterns between consecutive interactive items, ignoring the complex relationships between the high-order interactions.

Recently, graph neural networks (GNNs) have been widely utilized to model high-order transformation relationships in the area of SR. For example, NGCF [19] and LightGCN [20] argue that graph-based models can effectively deal with collaborative data and are critical in presenting user/item embedding. SRGNN [21] converts the sequence data into graph-structured data with the gated GNNs to perform message propagation on graphs. GCE-GNN [22] utilizes a GNN model for the session graph to learn item embedding with a session-aware attention mechanism. GC–MC [23] proposes a graph convolutional matrix completion (GC–MC) framework from the perspective of link prediction on graphs. However, the GC–MC framework only models the user’s direct scoring of items in a convolutional layer. CTDNE [24] defines a temporal graph to learn the dynamic embedding of nodes. However, most of these GNN-based models only capture sequential transition patterns based on items’ locations and identities, while ignoring the influence of contextual features such as time intervals, which will cause the models to fail to learn appropriate sequence representations.

Inspired by the SASRec [9], which has the advantages of not only uncovering the long-term user’s preference but also predicting the next recommended item mainly based on the recently interacted items, the proposed model utilizes the SASRec as the framework of the local representation learning layer. Different from the SASRec, the proposed local representation learning module infers the multiple temporal embedding with the different weight and their aggregations, to improve the performance of the local preference of users. In addition, the paper has not utilized the GNNs so the proposed model has a disadvantage at the high-order transformation relationships in some areas of the SR.

2.2 Temporal Recommendation

Since user preferences are dynamic in the SR problems, it is essential to capture temporal information. Originally, matrix factorization (MF) has often been utilized in temporal recommendation systems due to its relatively high accuracy and scalability [25]. However, it is difficult for the rating prediction of new users via the MF. Collaborative filtering-based (CF) approaches also have been proven of success in modeling temporal information [26]. However, some linear approaches are unsuitable for the more complex dynamics of temporal changes, and the transition matrix usually is homogeneous for all users [27]. In addition, it is impractical for large datasets due to their time-consuming performance and so on.

Afterward, there are also deep learning-based models in temporal recommendation systems, which encode temporal interval signals into the user preference representations to predict the user’s future actions. For example, Time-LSTM [28] combined the LSTM model with several time gates to achieve better performance of the time intervals in users’ interaction sequences. Zhang et al. [29] introduced a self-attention mechanism into the sequence-sense recommendation model that represented the user’s temporal interest. CTA model [30] utilized multiple-parameter kernel functions to address the temporal information as an attention mechanism [31]. Tang and Wang [10] combined a convolutional sequence embedding model with Top-N sequential recommendation as a way for a temporal recommendation. TiSASRec [32] is the absolute positions of items and the time intervals between them in a sequence. Moreover, TASER [33] proposes a temporal-aware network based on absolute and relative time patterns. KDA [34] utilizes the temporal decay functions of various relations between items to explore the time drift of their relation interactions.

In addition, some GNN-based temporal SR models have been proposed recently. For example, RetaGNN model [35] improves the inductive and transferable performance of the SR by encoding the sub-graphs resulting from the user-item pairs. Moreover, DSPR [36] proposes a novel sequential POI recommendation approach to capture the real-time impact and absolute time. DRL-SRe [37] obtains the dynamic preferences based on the global user-item graph for each time slice that deploys an auxiliary temporal prediction task over consecutive time slices. TARec [38] captures the user’s dynamic preferences based on the combination of an adaptive search-based module and a time-aware module.

Inspired by TiSASRec model [32], this paper seeks to utilize multi-scale temporal information embeddings based on the absolute and relative time patterns to comprehensively capture the diverse patterns of users’ behavior. However, self-attention operations and global preference learning have been integrated with the time-aware into the proposed model to improve the performance of the SR.

3 The MTSR-FLP Model

In this section, the paper proposes the MTSR-FLP model, i.e., fusing the global and the local preference of users with self-attention networks based on the multi-temporal embedding for the sequential recommendation. Without loss of generality, the paper has a recommendation system with implicit feedback given by a set of users to a set of items. Therefore, the entire recommendation process can be defined as follows.

  • Multiple temporal embeddings

    • Convert the \(\mathrm{T}= \left\{{\mathrm{t}}_{1}; \, {\mathrm{t}}_{2}; \, \dots ; \, {\mathrm{t}}_{\mathrm{L}}\right\}\in {\mathbb{R}}^{\left|\mathrm{I}\right|}\) into absolute embedding by combining \({\mathrm{E}}^{\mathrm{Day}}\in {\mathbb{R}}^{\left|\mathrm{I}\right|\times \mathrm{d}}\) and \({\mathrm{E}}^{\mathrm{Pos}} \in {\mathbb{R}}^{\left|\mathrm{I}\right|\times \mathrm{d}}\), which contain the information of the index and the time information of each item in sequence, respectively.

    • Convert the \(\mathrm{T}= \{{\mathrm{t}}_{1}; \, {\mathrm{t}}_{2}; \, \dots ; \, {\mathrm{t}}_{\mathrm{L}}\}\in {\mathbb{R}}^{\left|\mathrm{I}\right|}\) into relative embedding \({\mathrm{E}}^{\mathrm{Sin}}\in {\mathbb{R}}^{\left|\mathrm{l}\right|\times |\mathrm{l}|\times \mathrm{d}}\), \({\mathrm{E}}^{\mathrm{Exp}} \in {\mathbb{R}}^{\left|\mathrm{I}\right|\times |\mathrm{l}|\times \mathrm{d}}\) and \({\mathrm{E}}^{\mathrm{Log}} \in {\mathbb{R}}^{\left|\mathrm{I}\right|\times \left|\mathrm{I}\right|\times \mathrm{d}}\), which present different temporal information to a greater extent.

  • Local learning representation

    • Input matrix \({\mathrm{X}}^{(0)}\) = \(\{{\mathrm{x}}_{1};{\mathrm{x}}_{2};\dots ;{\mathrm{x}}_{\mathrm{L}}\}\) \(\in {\mathbb{R}}^{\left|\mathrm{I}\right|\times \mathrm{d}}\) is obtained, in which the absolute position embedding matrix \({\mathrm{E}}^{\mathrm{Abs}} = \{{\mathrm{a}}_{1};{\mathrm{a}}_{2};\dots ;{\mathrm{a}}_{\mathrm{L}}\}\in {\mathbb{R}}^{\left|\mathrm{I}\right|\times \mathrm{d}}\) to the embedding item matrix \(\mathrm{E }\in {\mathbb{R}}^{|\mathrm{I}|\times \mathrm{d}}\).

    • Input the matrix \({\mathrm{X}}^{(0)}\) into stacked self-attentive blocks (SAB) to get the output of the \({\mathrm{b}}^{\mathrm{th}}\) block as the \({\mathrm{X}}^{\left(\mathrm{b}\right)}={\mathrm{SAB}}^{\left(\mathrm{b}\right)}\left({\mathrm{X}}^{\left(\mathrm{b}-1\right)}\right)\mathrm{ b}\in \left\{\mathrm{1,2},\dots ,\mathrm{B}\right\}\).

  • Global learning representation

    • The global representation of the sequences is formalized as \(\mathrm{y}=\mathrm{LBA}\left(\mathrm{E}\right)=\mathrm{softmax}\left({\mathrm{q}}^{\mathrm{S}}{\left(\mathrm{EW}{\mathrm{^{\prime}}}_{\mathrm{K}}\right)}^{\mathrm{T}}\right)\mathrm{EW}{\mathrm{^{\prime}}}_{\mathrm{V}}\).

  • Item similarity gating

    • The specific computing equation is formalized as

      $$\mathrm{g}=\upsigma \left(\left[{\mathrm{m}}_{\mathrm{sl}},\mathrm{y},{\mathrm{m}}_{\mathrm{i}}\right]{\mathrm{W}}_{\mathrm{G}}+{\mathrm{b}}_{\mathrm{G}}\right),$$

where \(\mathrm{y}\) represents the global representation, \({\mathrm{m}}_{\mathrm{sl}}\) is the recently interacted item embedding vector, and \({\mathrm{m}}_{\mathrm{i}}\) is the candidate item embedding vector. \({\mathrm{W}}_{\mathrm{G}}\) and \({\mathrm{b}}_{\mathrm{G}}\) are weight and bias separately.

3.1 MTSR-FLP Model Overview

As shown in Fig. 1, the MTSR-FLP model mainly consists of four layers as follows: multi-temporal embedding learning; local representation learning; global representation learning; item similarity gating. Moreover, the entire algorithm of the MTSR-FLP model is described in Algorithm 1. The roles of the four components of the proposed model are described as follows:

Fig. 1
figure 1

The framework of the MTSR-FLP model

figure a

First, the multi-temporal embedding plays an important role in making full use of the temporal information of the user by constructing multiple temporal embeddings. In addition, the temporal embeddings are classified as two different forms of embedding vectors by the calculation way. Then, the paper utilizes a specific gate mechanism to integrate one form of embedding vectors as the absolute embeddings and integrate the other form of embedding vectors as the relative embeddings.

Second, the local representation learning layer mainly utilizes the self-attention mechanism, the absolute embeddings, and the relative embeddings learned in the multi-temporal embedding learning module to learn the local representation of the specific user.

Third, the global representation learning layer also utilizes all users’ historical sequence to capture the global preference of all users using the attention mechanism which is modified based on the self-attention mechanism.

Finally, the item similarity gating layer balances the global representation and the local representation using gate mechanism.

As shown in Table 1, there are the symbolic representations of the diagram in Fig. 1.

Table 1 Notation

3.2 Multi-temporal Embedding Learning Module

To address timestamp information, this module makes the following improvements:

  1. (1)

    The construction of multiple temporal embeddings: \({E}^{Exp}\), \({E}^{Log}\), \({E}^{Sin}\), \({E}^{Day}\), and \({E}^{Pos}.\)

  2. (2)

    The construction of absolute embedding by combining \({E}^{Day}\) and \({E}^{Pos}\).

  3. (3)

    The construction of relative embedding by combining \({E}^{Exp}\), \({E}^{Log}\), and \({E}^{Sin}\).

3.2.1 Time Series Embedding

In the user’s historical sequence, each item is clicked by the user at a specific time, so time information is more important for recommendation.

The paper converts the \(T= \left\{{t}_{1};{t}_{2};\dots ;{t}_{L}\right\}\in {\mathbb{R}}^{\left|I\right|}\) into \({E}^{Day}\in {\mathbb{R}}^{\left|I\right|\times d}\) and \({E}^{Pos} \in {\mathbb{R}}^{\left|I\right|\times d}\). The \(d\) denotes the hidden vector dimension. \({E}^{Pos}\) contains the information on the index of each item in a sequence. \({E}^{Day}\) contains the time information of each item in a sequence. Compared with \({E}^{Timestamp}\), the \({E}^{Day}\) is more reasonable. Since \({E}^{Day}\) can save more storage space compared with \({E}^{Timespamp}\) using one day as a time interval, in the meantime it can take advantage of the time information of users’ historical sequence.

In addition, the paper maintains a learnable embedding matrix \({M}^{D}\in {R}^{\left|D\right|\times d}\), where \(|D|\) is the time interval by taking a day as a unit of measure between the earliest clicked item and the latest clicked item in uses’ historical sequence. Position embedding is a learnable positional embedding, where \({M}^{P}\in {R}^{|l|\times d}\), where \(|l|\) represents the length of the sequence. In this paper, the paper defines \(|l|\) as 50.

The paper also converts the \(\mathrm{T}= \{{\mathrm{t}}_{1};{\mathrm{t}}_{2};\dots ;{\mathrm{t}}_{\mathrm{L}}\}\in {\mathbb{R}}^{\left|\mathrm{I}\right|}\) into \({E}^{Sin}\in {\mathbb{R}}^{\left|\mathrm{l}\right|\times |\mathrm{l}|\times \mathrm{d}}\), \({E}^{Exp} \in {\mathbb{R}}^{\left|\mathrm{I}\right|\times |\mathrm{l}|\times \mathrm{d}}\) and \({E}^{Log} \in {\mathbb{R}}^{\left|\mathrm{I}\right|\times \left|\mathrm{I}\right|\times \mathrm{d}}\). This paper uses different time embeddings to utilize different temporal information to a greater extent. First, this paper defines a temporal differences matrix as \(\mathrm{D}\in {\mathbb{R}}^{\mathrm{N}\times \mathrm{N}}\), in which the element is defined as \({\mathrm{d}}_{\mathrm{ab}}=\frac{{\mathrm{t}}_{\mathrm{a}}-{\mathrm{t}}_{\mathrm{b}}}{\tau }\), where \(\tau\) is the adjustable unit time difference; in the \({\mathrm{E}}^{\mathrm{Sin}},\) the paper converts the difference \({\mathrm{d}}_{\mathrm{ab}}\) to a latent vector \({\overrightarrow{\uptheta }}_{\mathrm{ab}}\in {\mathbb{R}}^{1\times \mathrm{h}}\), as shown in Eq. (1):

$$\begin{array}{c}{\overrightarrow{\uptheta }}_{\mathrm{ab},2\mathrm{c}}=\mathrm{sin}\left(\frac{{\mathrm{d}}_{\mathrm{ab}}}{{\mathrm{freq}}^{\frac{2\mathrm{c}}{\mathrm{h}}}}\right){\overrightarrow{\uptheta }}_{\mathrm{ab},2\mathrm{c}+1}=\mathrm{cos}\left(\frac{{\mathrm{d}}_{\mathrm{ab}}}{{\mathrm{freq}}^{\frac{2\mathrm{c}}{\mathrm{h}}}}\right)\end{array}$$
(1)

where \({\overrightarrow{\theta }}_{\mathrm{ab},\mathrm{c}}\) is the \({\mathrm{c}}^{\mathrm{th}}\) value of the vector \({\overrightarrow{\theta }}_{\mathrm{ab}}\). In the above equation, \(\mathrm{freq}\) and \(\mathrm{h}\) are adjustable parameters. Similarly, in the \({\mathrm{E}}^{\mathrm{Exp}}\), the paper converts \({\mathrm{d}}_{\mathrm{ab}}\) to \({\overrightarrow{\mathrm{e}}}_{\mathrm{ab}}\) and \({\overrightarrow{\mathrm{l}}}_{\mathrm{ab}}\) respectively, and the superposition of these vectors gives \({\mathrm{E}}^{\mathrm{Exp}}\) and \({\mathrm{E}}^{\mathrm{Log}}\) respectively, as shown in Eq. (2):

$$\begin{array}{c}{\overrightarrow{\mathrm{e}}}_{\mathrm{ab},\mathrm{c}}=\mathrm{exp}\left(\frac{-\left|{\mathrm{d}}_{\mathrm{ab}}\right|}{{\mathrm{freq}}^{\frac{\mathrm{c}}{\mathrm{h}}}}\right){\overrightarrow{\mathrm{l}}}_{\mathrm{ab},\mathrm{c}}=\mathrm{log}\left(1+\frac{\left|{\mathrm{d}}_{\mathrm{ab}}\right|}{{\mathrm{freq}}^{\frac{\mathrm{c}}{\mathrm{h}}}}\right) \end{array}$$
(2)

3.2.2 Absolute and Relative Embeddings Calculation

The calculation of absolute embeddings and relative embeddings is shown in Fig. 2. Here, \({E}^{Day}\) and \({E}^{Pos}\) are not simply addition operations to get the absolute embeddings, however, the paper calculates the different weights of the \({E}^{Day}\) and \({E}^{Pos}\), then the paper adds them together via their weights. The specific calculation is described in Eq. (3):

$$\begin{array}{c}\alpha = \sigma \left(\left[{\mathrm{ E}}^{\mathrm{Day}} , {\mathrm{ E}}^{\mathrm{Pos}}\right]{\mathrm{W}}_{{\mathrm{day}}_{\mathrm{pos}}}+ {\mathrm{b}}_{{\mathrm{day}}_{\mathrm{pos}}}\right) \end{array}$$
(3)

where \([.,.]\) denotes the ternary concatenation operation, \({W}_{day\_pos}\in {\mathbb{R}}^{2d\times 1}\) and \({b}_{day\_pos} \in {\mathbb{R}}^{1\times 1}\). To restrict each element in \(\alpha\) from 0 to 1, the paper utilizes the sigmoid activation function \(\sigma \left(z\right)=1/(1+{e}^{-z})\). Then, the paper adds \({E}^{Day}\) to \({E}^{Pos}\) to obtain \({E}^{Abs}\). The calculation procession is formalized in Eq. (4):

Fig. 2
figure 2

Absolute and relative embeddings

$$a_{{s\ell }} = ~\alpha _{s} \otimes p_{{s\ell }} + ~(1 - ~\alpha _{s} )\, \otimes p_{{s\ell }} ,\ell \in \left\{ {1,2, \ldots ,L} \right\}~$$
(4)

In the end, the paper can get the absolute embedding \({E}^{Abs}\in {\mathbb{R}}^{\left|I\right|\times d}\).Meanwhile, the paper calculates the relative embeddings with the same approaches of above. First, the paper can calculate \(\beta\) to balance the \({E}^{Exp}\) and \({E}^{Log}\). This calculation process is described in Eqs. (5, 6):

$$\begin{array}{c}\beta = \sigma \left(\left[{\mathrm{ E}}^{\mathrm{Sin}} , {\mathrm{ E}}^{\mathrm{Log}}\right]{\mathrm{W}}_{\mathrm{sin}\_\mathrm{log}}+ {\mathrm{b}}_{\mathrm{sin}\_\mathrm{log}}\right) \end{array}$$
(5)

where \({W}_{\mathrm{sin}\_log}\in {\mathbb{R}}^{2d\times 1}\), \({b}_{\mathrm{sin}\_log} \in {\mathbb{R}}^{\left|I\right|\times \left|I\right|\times 1}\). Then, we add \({E}^{Sin}\) to \({E}^{Log}\) to obtain \({E}^{Sin\_Log}\) using weight \(\beta\):

$$s_{{l\,s\ell }} = ~\beta _{s} \otimes s_{{s\ell }} + ~(1 - ~\beta _{s} ) \otimes l_{{s\ell }} ,\ell \in \left\{ {1,2, \ldots ,L} \right\}~$$
(6)

Second, the paper calculates the weight \(\upgamma\) using a method in Eq. (7):

$$\begin{array}{c}\gamma = \sigma \left(\left[{\mathrm{ E}}^{{\mathrm{Sin}}_{\mathrm{Log}}} , {\mathrm{ E}}^{\mathrm{Exp}}\right]{\mathrm{W}}_{\mathrm{sin}\_\mathrm{log}\_\mathrm{exp}}+ {\mathrm{b}}_{\mathrm{sin}\_\mathrm{log}\_\mathrm{exp}}\right) \end{array}$$
(7)

where \({W}_{\mathrm{sin}\_\mathrm{log}\_exp}\in {\mathbb{R}}^{2d\times 1}\), \({b}_{\mathrm{sin}\_\mathrm{log}\_exp} \in {\mathbb{R}}^{\left|I\right|\times \left|I\right|\times 1}\). Finally, the paper adds \({E}^{Sin\_log}\) to \({E}^{Exp}\) to obtain the relative embeddings \({E}^{Rel}\in {\mathbb{R}}^{\left|I\right|\times \left|I\right|\times 1}\) by weight \(\gamma\) in Eq. (8):

$$r_{{s\ell }} = ~\gamma _{s} \otimes s_{{l\,s\ell }} + ~(1 - ~\gamma _{s} ) \otimes e_{{s\ell }} ,\ell \in \left\{ {1,2, \ldots ,L} \right\}~$$
(8)

3.3 Local Learning Representation Module

Based on the SASRec model [9], the output vector \(\mathrm{x}\) from the top self-attentive block is used as the local learning representation for the user's dynamic preferences in a sequence. Due to the importance of the absolute positions of a user-item interaction of a sequence, the paper adds the absolute position embedding matrix \({\mathrm{E}}^{\mathrm{Abs}} = \{{\mathrm{a}}_{1};{\mathrm{a}}_{2};\dots ;{\mathrm{a}}_{\mathrm{L}}\}\in {\mathbb{R}}^{\left|\mathrm{I}\right|\times \mathrm{d}}\) to the embedding item matrix \(\mathrm{E }\in {\mathbb{R}}^{|\mathrm{I}|\times \mathrm{d}}\) 33. In the end, an input matrix \({\mathrm{X}}^{(0)}\) = \(\{{\mathrm{x}}_{1};{\mathrm{x}}_{2};\dots ;{\mathrm{x}}_{\mathrm{L}}\}\) \(\in {\mathbb{R}}^{\left|\mathrm{I}\right|\times \mathrm{d}}\) is obtained. The specific process is formalized as shown in Eq. (9):

$$x_{\ell }^{{\left( 0 \right)}} = m_{{s_{\ell } }} + a_{\ell } ,\ell \in \left\{ {1,2, \ldots ,L} \right\}~$$
(9)

Afterward, the paper inputs the matrix \({\mathrm{X}}^{(0)}\) into stacked self-attentive blocks (SAB) to get the output of the \({\mathrm{b}}^{\mathrm{th}}\) block which is defined as shown in Eq. (10):

$$\begin{array}{c}{\mathrm{X}}^{\left(\mathrm{b}\right)}=SA{\mathrm{B}}^{\left(\mathrm{b}\right)}\left({\mathrm{X}}^{\left(\mathrm{b}-1\right)}\right) b\in \left\{\mathrm{1,2},\dots ,\mathrm{B}\right\} \end{array}$$
(10)

Next is the omission of the normalization layer with residual connections, where each self-attentive block is considered as a self-attentive layer (・), and then the results are fed into the feed forward layer, and values are calculated as shown in Eqs. (1114):

$$\begin{array}{c}SAB\left(\mathrm{X}\right)=FFL\left(\mathrm{SAL}\left(\mathrm{X}\right)\right) \end{array}$$
(11)
$$\begin{array}{c}{\mathrm{X}}^{\mathrm{^{\prime}}}=SAL\left(\mathrm{X}\right)= soft\mathrm{max}\left(\partial \otimes \frac{{\mathrm{QK}}^{\mathrm{T}}}{\sqrt{\mathrm{d}}}+\left(1-\partial \right)\otimes \frac{{\mathrm{QK}}^{\mathrm{T}}}{\sqrt{\mathrm{d}}}\otimes \frac{{\mathrm{QE}}^{\mathrm{Rel T}}}{\sqrt{\mathrm{d}}}\right)\Delta \cdot V \end{array}$$
(12)
$$\begin{array}{c}{\text{F}}{\text{F}}{\text{L}}\left({\mathrm{X}}^{\mathrm{^{\prime}}}\right)=\mathrm{ReL}U\left({\mathrm{X}}^{\mathrm{^{\prime}}}{\mathrm{W}}_{1}+{1}^{\mathrm{T}}{\mathrm{b}}_{1}\right){\mathrm{W}}_{2}+{1}^{\mathrm{T}}{\mathrm{b}}_{2} \end{array}$$
(13)
$$\begin{array}{c}\partial = \sigma \left(\left[\frac{{\mathrm{QK}}^{\mathrm{T}}}{\sqrt{\mathrm{d}}}, \frac{{\mathrm{QK}}^{\mathrm{T}}}{\sqrt{\mathrm{d}}}\otimes \frac{{\mathrm{QE}}^{\mathrm{Rel T}}}{\sqrt{\mathrm{d}}}\right]{\mathrm{W}}_{{\mathrm{abs}}_{\mathrm{rel}}}+ {\mathrm{b}}_{{\mathrm{abs}}_{\mathrm{rel}}}\right) \end{array}$$
(14)

Here, \(X\) is the input matrix containing absolute positional information, Q \(=X{W}_{Q}\),\(K=X{W}_{K}\),\(V=X{W}_{V}\), in which\({W}_{Q}\), \({W}_{K}\) and \({W}_{V}\) present different weight matrixes. To consider the influence of each interaction pairs in the sequence, the paper uses relative embedding\({E}^{Rel}\). Moreover, the weight \(\partial\) is introduced to balance the effect of the absolute embeddings and the relative embeddings. In Eq. (14), \({W}_{\mathrm{abs}\_\mathrm{rel}}\in {\mathbb{R}}^{2d\times 1}\) and\({b}_{abs\_rel} \in {\mathbb{R}}^{d\times 1}\), \([.,.]\) denotes the ternary concatenation operation, \(\sigma\) is the sigmoid function. Using the method, the paper can control the influence of the relative embeddings.

To improve flexibility, \({W}_{1}\), \({W}_{2}\) are the weights, and \({b}_{1}\), \({b}_{2}\) are biases of the two convolution layers. In the above equations, \(1\in {\mathbb{R}}^{d\times 1}\) is a unit vector, and \(\Delta\) is a unit lower triangular matrix to be used as a mask, preserving only the transformation of the previous step. Meanwhile, the paper introduces a layer of normalization to normalize the output and dropout to avoid over-fitting, in which specific usage of layer normalization and dropout is the same as those in the SASRec [9].

3.4 Global Representation Learning Layer

Inspired by the paper [39], the paper proposes a novel attention module, called the location-based attention layer LBA (・), to obtain the global learning representation of all users. In particular, the user u’s preference for interaction item s is generated in step \(l+1\) as a uniform aggregation of other interaction item representations, so the predicted ratings can be considered as the factorial similarity between user u’s historical item and candidate item s, as shown in Eq. (15):

$$\tilde{y}_{{l + 1}}^{u} = \frac{1}{{\sqrt {\left| {S^{u} \backslash \left\{ {s_{{l + 1}}^{u} } \right\}} \right|} }}\mathop \sum \limits_{{i^{\prime} \in Su\backslash \left\{ {s_{{l + 1}}^{u} } \right\}}} m_{{i^{\prime}}} ~$$
(15)

In this paper, to calculate the most representative feature representation of all items in all sequences, the paper introduces \({q}^{S}\in {\mathbb{R}}^{1\times \mathrm{d}}\) as the query vector, rather than an average weighted aggregation. The \({q}^{S}\) is different from \(Q\) in Eq. (12). This is because this module aims to calculate all users’ global preferences instead of individual preferences.

The global representation of the sequences is formalized in Eq. (16):

$$\begin{array}{c}y=LBA\left(\mathrm{E}\right)=soft\mathrm{max}\left({\mathrm{q}}^{\mathrm{S}}{\left(\mathrm{EW}{\mathrm{^{\prime}}}_{\mathrm{K}}\right)}^{\mathrm{T}}\right)EW{\mathrm{^{\prime}}}_{\mathrm{V}} \end{array}$$
(16)

where \(E\) is the initial input matrix and \(W{\mathrm{^{\prime}}}_{K}\), \(W{\mathrm{^{\prime}}}_{V}\) are the learnable weight matrixes, similar to \({W}_{Q}\),\({W}_{K}\),\({W}_{V}\).

The dropout layer in the training process can avoid over-fitting and generalize the global representation in all steps, and then obtain the final global representation of all users' matrix \(Y\), as shown in Eq. (17):

$$\begin{array}{c}{\mathrm{y}}_{\mathrm{l}}=Dropout\left(\mathrm{y}\right) l\in \left\{\mathrm{1,2},\dots ,\mathrm{L}\right\} \end{array}$$
(17)

3.5 Item Similarity Gating Layer

Designing the item similarity gating module has two main reasons. First, the global representation and the local representation are all important to the final recommendation. The paper must distinguish the importance between the global learning representation and the local learning representation. Second, the problem of uncertainty in users’ intent in SR when user sees candidate items commonly exists.

Inspired by the NAIS [40], the paper designs a novel item similarity gating module which is critical in coordinating the local preference of the specific user and the global preference of all users. Meanwhile, it considers the influence of candidate items by calculating the similarity of candidate items and recently interacted items. The specific computing equation is formalized as follows in Eq. (18):

$$\begin{array}{c}g= \sigma \left(\left[{\mathrm{m}}_{\mathrm{sl}},\mathrm{y},{\mathrm{m}}_{\mathrm{i}}\right]{\mathrm{W}}_{\mathrm{G}}+{\mathrm{b}}_{\mathrm{G}}\right) \end{array}$$
(18)

where \(y\) represents the global representation, \({m}_{sl}\) is the recently interacted item embedding vector, and \({m}_{i}\) is the candidate item embedding vector. \({W}_{G}\) and \({b}_{G}\) are weight and bias separately. The paper utilizes the sigmoid function as the activation function.

At step \(l\) of the sequence, the final representation \({z}_{i}\) is the weighted sum of the related local learning representation \({x}_{i}\) and the global learning representation \(y\), as shown in Eq. (19):

$$\begin{array}{c}{\mathrm{Z}}_{\mathrm{l}}={\mathrm{x}}_{\mathrm{i}}\otimes g+y\otimes \left(1-\mathrm{g}\right) \end{array}$$
(19)

where \(\otimes\) is the broadcast operation in TensorFlow.

The final prediction of the item \(i\) preference to be the \({(\mathrm{i}+1)}^{\mathrm{th}}\) item in the sequence is shown in Eq. (20):

$$\begin{array}{c}{\mathrm{R}}_{\mathrm{l}+1,\mathrm{i}}={\mathrm{z}}_{\mathrm{l}}{\left({\mathrm{m}}_{\mathrm{i}}\right)}^{\mathrm{T}} \end{array}$$
(20)

MTSR-FLP is trained using the Adam optimizer, which loss function is shown in Eq. (21):

$$\begin{array}{c}L=-\sum \limits_{\mathrm{u}\in \mathrm{U}}\sum\limits _{\mathrm{l}=1}^{\mathrm{L}-1}\updelta \left({\mathrm{s}}_{\mathrm{l}+1}^{\mathrm{u}}\right)\left[\mathrm{log}\left(\upsigma \left({\mathrm{r}}_{\mathrm{l}+1,{\mathrm{s}}_{\mathrm{l}+1}^{\mathrm{u}}}^{\mathrm{u}}\right)\right)+\mathrm{log}\left(1-\upsigma \left({\mathrm{r}}_{\mathrm{l}+1,\mathrm{j}}^{\mathrm{u}}\right)\right)\right] \end{array}$$
(21)

in which \({\mathrm{r}}_{\mathrm{l}+1,\mathrm{j}}^{\mathrm{u}}\) is the negative sample’s possibility as the \({(\mathrm{i}+1)}^{\mathrm{th}}\) item in the sequence, and \(\updelta \left({\mathrm{s}}_{\mathrm{l}+1}^{\mathrm{u}}\right)\) is used as eliminating the influence of padding items. If \({\mathrm{s}}_{\mathrm{l}+1}^{\mathrm{u}}\) is a padding item, \(\updelta \left({\mathrm{s}}_{\mathrm{l}+1}^{\mathrm{u}}\right)\) will be 0, otherwise, it will be 1.

4 Experimental Results and Analysis

4.1 Experimental Environment

The related experiments environment of the proposed model is based on versions Python 3.6 and above and torch 1.10.0 or higher, and the running environment version requires Anaconda 3-2020.02 and above. The main data packages include cuda 10.2, cudnn10.2, torch = = 1.10.0 + cu102, networkx = = 2.5.1, numpy = = 1.19.2, pandas = = 1.1.5, six = = 1.16.0, scikit-learn = = 0.24.2, and space = = 3.4.0. The hardware environment of this model is processor Intel Xeon Gold 5218, GPU is Rtx 3060 and memory is 32 GB DDR4.

This paper verifies the proposed model performance under strong generalization conditions by dividing all users into training/testing sets in the ratio of eight to two. In the user’s historical sequence, the last item in the user’s historical sequence is used in the testing set, and other items are used in the training set. During the process of training, the model is evaluated using the following two commonly used metrics. This paper conducts experiments on five real-world datasets—Beauty [41], Games [42], Steam [43], Foursquare [44] and Tmall [45]. The specific statistics are shown in Table 2. The optimal values of the nine parameters shown in Table 3 are the conclusions drawn through experimental comparison.

Table 2 Statistics of the processed datasets
Table 3 Parameter descriptions and optimal settings

4.2 Evaluation Indicators

4.2.1 Click-Through Rate

This paper uses click-through rate (HR) to evaluate the percentage of recommended items that incorporate the correct item for at least one user interaction, and the metric is calculated as shown in Eq. (22):

$$\begin{array}{c}HR@N=\frac{1}{\left|\mathrm{U}\right|}\sum \limits_{\mathrm{u}\in \mathrm{U}}\updelta \left(\left|{\widehat{\mathrm{I}}}_{\mathrm{u},\mathrm{N}}\cap {\mathrm{T}}_{\mathrm{u}}\right|>0\right) \end{array}$$
(22)

in which \(\updelta \left(\cdot \right)\) is the indicator function and equal to 1 if \(\left|{\widehat{\mathrm{I}}}_{\mathrm{u},\mathrm{N}}\cap {\mathrm{T}}_{\mathrm{u}}\right|>0\).

4.2.2 Normalized Discounted Cumulative Gain

The paper takes the normalized discounted cumulative gain (NDCG@10) because it takes into account the location of correctly recommended items, and is calculated as shown in Eq. (23):

$$\begin{array}{c}{\text{NDCG@N}}=\frac{1}{\mathrm{Z}}DCG@N=\frac{1}{\mathrm{Z}}\frac{1}{\left|\mathrm{U}\right|}\sum \limits_{\mathrm{u}\in \mathrm{U}}\sum \limits_{\mathrm{k}=1}^{\mathrm{N}}\frac{\updelta \left({\widehat{\mathrm{i}}}_{\mathrm{u,k}}\in {\mathrm{I_{u}}}\right)}{{\mathrm{log}}_{2}\left(\mathrm{k}+1\right)} \end{array}$$
(23)

where \({\widehat{\mathrm{i}}}_{\mathrm{u},\mathrm{k}}\) denotes the \({\mathrm{k}}^{\mathrm{th}}\) item recommended to user \(u\), and \(Z\) is a normalization constant indicating the discounted cumulative gain (NDCG@10) of the ideal case, which is the maximum possible value of NDCG@10.

4.3 Baseline Model

The following baseline models are utilized to be compared with the proposed MTSR-FLP model in this paper:

  • BPRMF [46]: a standard recommendation model based on MF via pairwise loss.

  • FISM [47]: another popular model learning the similarity matrix between items.

  • GRU4Rec [1]: one of the earliest improved RNN models based on deep learning via an additional sampling strategy land twice loss function for sequence recommendation.

  • SASRec [9]: a popular sequential recommendation model using a self-attention network.

4.4 Experimental Results

As shown in Table 4, the MTSR-FLP model achieves better results in both HR@10 and NDCG@10 compared with other sequence recommendation models. The performance comparison of MTSR-FLP and SASRec is shown in Figs. 3 and 4.

Table 4 HR@10 and NDCG@10 performance comparison
Fig. 3
figure 3

SASRec model and MTSR-FLP model on five different datasets performance on HR@10. The blue dotted line represents MTSR-FLP and the red dotted line represents the SASRec model. It can be seen that under the evaluation index of NDCG@10, MTSR-FLP performs better than the SASRec model in the five datasets

Fig. 4
figure 4

SASRec model and MTSR-FLP model on five different datasets performance on NDCG@10. It can be seen that under the evaluation index of NDCG@10, MTSR-FLP performs better than the SASRec model in the five datasets

From Figs. 3 and 4, the MTSR-FLP’s performance in five datasets is much better than those of the SASRec. In the foursquare dataset, the performance SASRec is better than MTSR-FLP. This might because that the results are influenced by over-fitting with the longer training time.

4.5 Ablation Study

As shown in Table 5, model-1 is the proposed model, the other models have different attention factors including absolute embeddings, relative embeddings, and calculating approaches. The performances of models with different settings are described in Table 6.

Table 5 MTSR-FLP models with different settings
Table 6 Performance of models with different settings

In the model-2, the paper defines \({E}^{Pos}\) as the absolute embedding, \({E}^{Exp}\) as the relative embedding. Since model-2 has only one absolute embedding and one relative embedding, the \({E}^{Abs}\) is the same as \({E}^{Pos}\) and the \({E}^{Rel}\) is the same as \({E}^{Exp}\). In other words, the absolute embedding calculation block and the relative embedding calculation block are useless. Therefore, Eq. (12) is replaced by Eq. (24):

$$\begin{array}{c}{X}{\prime}=SAL\left(X\right)=soft\mathit{max}\left(\frac{Q{K}^{T}}{\sqrt{d}}\otimes \frac{Q{E}^{Rel T}}{\sqrt{d}}\right)\Delta \cdot V \end{array}$$
(24)

Compared with Eq. (12), there is no parameter \(\partial\) which means model-2 does not consider the negative influence of relative embeddings. Model-3 is similar to the model-2 except that \({E}^{Exp}\) is replaced with \({E}^{Sin}\).

In model-4, the paper uses \({E}^{Pos}\) as absolute embedding and uses \({E}^{Exp}\) and \({E}^{Log}\) as the relative embeddings, in which \({E}^{Abs}\) is same as \({E}^{Pos}\), so the absolute embedding calculation block is useless. The relative embedding calculation block is different from that of the proposed model. First, the paper concatenates the \({E}^{Sin}\mathrm{ and }{ E}^{Log}\) together to get \({E}^{Rel}\). Then, the paper feds \({E}^{Rel}\) into the feed forward neural network to get the final \({E}^{Rel}\). Model-4 does not consider the negative influence of the relative embeddings. This can be formalized by following Eq. (25):

$$\begin{array}{c}{E}^{\mathrm{Rel}}=FFL\left({[E}^{\mathrm{Sin}},{ E}^{\mathrm{Log}}]\right) \end{array}$$
(25)

The performance of different settings of the model is shown in Table 6. To highlight the training performance, the paper depicts the results in Fig. 5. From the Fig. 5 the proposed model performs best in almost all datasets on HR@10 and NDCG@10. The specific training process is shown below. The paper can see the proposed model’s performance is better than the model with other settings in almost all datasets.

Fig. 5
figure 5

The performance of MTSR-FLP with different settings. In both HR@10 and NDCG@10, Model-1 performed very well. Except for Tmall datasets, Model-1 outperforms other models on the other four datasets

The specific training process is shown in Figs. 6 and 7. It is easy to see that the proposed model’s performance is better than the model with other settings in almost all datasets. The analysis of ablation experimental results is described as follows.

  • Model-1 vs. Model-2 and Model-3. In the training process, the paper can see the performance of Model-1 is better than Model-2 and Model-3 in most datasets. The reason for this is that Model-1 only uses \({E}^{Exp}\) as relative embedding and the Model-2 only use \({\mathrm{E}}^{\mathrm{Log}}\) as relative embedding, but the proposed model use\({E}^{Exp}\), \({E}^{Log}\), and \({E}^{Sin}\) as relative embeddings. Therefore, the proposed model uses more information than Model-2 and Model-3.

  • Model-2 vs Model-3. In different datasets, the performance of Model-2 and Model-3 is different. This is mainly because the \({E}^{Exp}\) and \({E}^{Sin}\) capture different time information. If the paper uses \({E}^{Exp}\) as relative embedding, a large time gap can quickly decay to zero. But if the paper uses \({E}^{Sin}\) as relative embedding, the relative embedding only captures periodic occurrences. Time information is different in different datasets, so the performance in different datasets is different.

  • Model-4 vs Model-2 and Model-3. Compared with Model-2 and Model-3, Model-4 performs worse than Model-2 and Model-3. Though Model-4 uses more information than Model-2 and Model-3, combining \({E}^{Exp}\) and \({E}^{Log}\) using feed forward network causes over-fitting.

  • Model-1 vs Model-2, Model-3 and Model-4. From Table 6, it can be seen that the performance of Model-2, Model-3, and Model-4 in datasets foursquare is worse than the proposed model. In these models, they do not consider the negative influence of relative embedding. But in this datasets, relative time information is useless.

Fig. 6
figure 6

The performance in different HR@10 for MTSR-FLP models with different settings. Model-2 performed well on the first datasets. Model-1 outperformed the other models on the remaining four datasets

Fig. 7
figure 7

The performance in different epochs of NDCG@10 for MTSR-FLP models with different settings

5 Conclusion

This paper proposes a novel MTSR-FLP model for the sequence recommendation. In particular, the paper integrates attention mechanisms, including multi-temporal embeddings self-attention network, and a factorial item similarity, to improve the performance of the SR. In particular, the multiple temporal embedding provides the absolute embeddings and relative embeddings for the self-attention network using temporal information and multiple encoding functions, where different attention heads handle different temporal embeddings. Meanwhile, this paper also designs a gating module for modulating local and global representations, considering the relationship between candidate items, recent interacted items, and all user’s global preferences to cope with possible uncertainty in user intent.

In addition, there are some directions to improve the proposed model. First, although this paper has tried deep learning and self-attention network models, more strategies might be worth trying to improve the performance of SR. In the near future, the paper will try integrating some new attention mechanisms, such GNNs, and generative adversarial networks (GANs) into the MTSR-FLP model. Second, future works should capture the multi-dimension information of items and the users in the SR, which can be utilized to further classify recommendation items and the users. In addition, more detailed time feature information also needs to be obtained to improve the accuracy of the SR. Finally, the paper will utilize new datasets to evaluate and improve the performance of the proposed model.