Self-supervised global context graph neural network for session-based recommendation

Session-based recommendation (SBR) aims to recommend the next items based on anonymous behavior sequences over a short period of time. Compared with other recommendation paradigms, the information available in SBR is very limited. Therefore, capturing the item relations across sessions is crucial for SBR. Recently, many methods have been proposed to learn article transformation relationships over all sessions. Despite their success, these methods may enlarge the impact of noisy interactions and ignore the complex high-order relationship between non-adjacent items. In this study, we propose a self-supervised global context graph neural network (SGC-GNN) to model high-order transition relations between items over all sessions by using virtual context vectors, each of which connects to all items in a given session and enables to collect and propagation information beyond adjacent items. Moreover, to improve the robustness of the proposed model, we devise a contrastive self-supervised learning (SSL) module as an auxiliary task to jointly learn more robust representations of the items in sessions and train the model to fulfill the SBR task. Experimental results on three benchmark datasets demonstrate the superiority of our model over the state-of-the-art (SOTA) methods and validate the effectiveness of context vectors and the self-supervised module.


INTRODUCTION
In the era of information explosion, recommendation systems (RS) play critical roles in various online commercial applications due to their success in addressing information overload by recommending useful context to users. Many existing recommendation approaches apply user profiles and long-term historical interactions to predict their preference, e.g., collaborative filtering (Sarwar et al., 2001), matrix factorization (Rendle et al., 2009), and deep learning based methods (He et al., 2017). However, in many real-world scenarios, such information may not exist. Consequently, session-based recommendation (SBR) has recently attracted more and more attention, which aims to predict the next interested item based on a given anonymous behavior sequence within a short period of time.
Early methods (Zimdars, Chickering & Meek, 2001) used Markov chains to predict the next item based on the previous clicked items in nature and have limited prediction accuracy due to the strong independence assumption. In recent years, with the The main contributions of this work are summarized as follows.
To the best of our knowledge, this is the first work that adds context vectors to a global session graph to learn the relationships between item pairs that are not directly connected cross sessions. A novel supervised module is proposed to obtain more robust item representations in the global context graph. A unified scheme is used to combine the pairwise item-transition information in OSG and high-order relationships between adjacent and non-adjacent items in GCSG. Extensive experiments show that SGC-GNN has superiority over the SOTA baselines and achieves significant improvements on three real-world datasets.

RELATED WORK
Session-based recommendation SBR aims to capture dynamic user preferences to provide more timely and accurate recommendations . Early SBR mainly used Markov decision processbased methods to capture the sequence signals in interactions. Wu et al. (2013) proposed a Personalized Markov Embedding (PME) model to embed both users and items into a Euclidean space in which the distance between users and items reflects the strength of their relationships. Le, Fang & Lauw (2016) developed a hidden Markov model to incorporating dynamic user-biased emission and context-biased transition for recommendation. With the development of deep learning, many methods take advantage of the powerful capabilities of deep neural networks to model the complex dependencies in interactions for recommendations. Wang et al. (2020a) proposed a method to incorporate dwell time in SBR and uses RNNs to predict the next item. Tan, Xu & Liu (2016)  augmentation techniques and considers temporal shifts in user behavior to improve the performance of SBR. Song, Elkahky & He (2016) employed an MLP layer to combine both long-term static and short-term temporal user preferences and trains model with a pretrain method.
In recent years, GNNs has developed rapidly. They are widely used in various fields (Zhou et al., 2020a), such as providing optimal bike station layouts in the area of decision support systems  and predicting traffic states (Zheng et al., 2020). In addition, some methods employ GNN to model the complex transitions within or between sessions which have shown promising results  in session recommendation. Wu et al. (2019) introduced GNN into SBRS firstly and achieves superior performance. Chen & Wong (2020) proposed a lossless encoding scheme to address the lossy session encoding problem and devises a shortcut graph attention layer to capture long-range dependencies. Qiu et al. (2019) proposed a weighted attention graph layer to learn the embedding of items and sessions for the next item recommendation. Wang et al. (2022) simulates users' behavior patterns in the session without destroying the click order and highlights the critical preferences of users during the simulate process. Xu et al. (2019) dynamically constructed a graph structure for session sequences and uses the self-attention network and GNN to capture global dependencies and local short-term dependencies, respectively. Huang et al. (2021) developed a position-aware attention mechanism to learn item transitional regularities within individual sessions and proposed a graph-structured hierarchical relation encoder to capture the cross-session item transitions explicitly. Deng et al. (2022) decomposed the session-based recommendation workflow into two steps. They built a global graph over all session data, learn global item representations in an unsupervised manner, and later refine these representations in individual session graphs. Qiu et al. (2020) constructed a broadly connected session graph to exploit and incorporate cross-session information in the individual session's representation learning. Although these studies demonstrate that GNN-based have achieved good performance, they construct graphs based only on the adjacency or sequential relationships of items, making it difficult to model complex higher-order relationships between items in different sessions effectively. Xia et al. (2021b) constructed a hypergraph to capture the high-order correlations among items, which works similarly to our approach. However, it ignored the critical sequential relationships of items in the session and introduces a lot of noise, which reduces the robustness of the model. Besides, Pan et al. (2020) added a star node into a session graph to consider the non-adjacent items, which inspired us to build nodes representing each session on the cross-session graph to learn higher-order relationships beyond adjacent items in different sessions.

Self-supervised learning
Self-supervised learning, especially contrastive learning (Hjelm et al., 2019;, designed to learn data representations from raw data with annotations, can learn user representations more robustly. Yao et al. (2021) proposed a multi-task SSL framework for large-scale item recommendations and devised a data augmentation method from the perspective of feature correlations. Wu et al. (2021) employed three types of data augmentation from different aspects and takes node self-discrimination as the selfsupervised task to offer an auxiliary signal for representation learning. Xia et al. (2021a) learned session representations from the session view and the item view by a selfsupervised graph co-training framework, which can iteratively select evolving pseudolabels as informative self-supervision examples for each view to improve the performance of recommendation.

Problem statement
Let V ¼ fv 1 ; v 2 ; . . . ; v jVj g denote the set consisting of all unique items involved in all sessions, where |V| is the number of all unique items. A session S can be represented by an item list S ¼ ½v s 1 ; v s 2 ; …; v s m ordered by timestamps, where v s i 2 V represents the i−th clicked item within the session S. The goal of SBR is to predict the next click item v s mþ1 for the session S given v s 1 ; v s 2 ; …; v s m .

Ongoing session graph module
An ongoing session graph (OSG) module aims to learn personalized item embedding by modeling sequential patterns in the current session. First, each session sequence S is modeled as a directed graph G s ¼ ðV s ; E s Þ. We name the graph OSG. Concretely, each node represents an item v s i 2 V, each edge ðv s iÀ1 ; v s i Þ 2 E s means that a user clicks item v s i after v s iÀ1 in the session S. The transition relationship between two items in the session can be represented by an incoming matrix and an outgoing matrix. For example, given a session S = [v 1 , v 2 , v 3 , v 2 , v 8 ], let G s denote its OSG graph, its incoming matrix M s,I and outgoing matrix M s,O are shown in Fig. 2. We concatenate the incoming matrix and the outgoing matrix to obtain a matrix M = Concat(M I , M O ) for each session, which denotes how nodes in the OSG communicate with each other. We embed each item v ∈ V into a unified embedding space and the node vector v 2 R d indicates the latent vector of item v. Concretely, we generate a d-dimension embedding v i for each unique item v i in the session through an embedding layer. Then, we use the gated graph neural networks (GGNN) (Wu et al., 2019) to update each item embedding in the graph OSG, where the adjacency matrix M s and the l − 1 layer embedding v s;lÀ1 Ã are used to update the embedding v s;l i of node v i in G s at layer l as follows.
where W; W z ; W r ; W o 2 R dÂ2d and U z ; U r ; U o 2 R dÂd controls the weights, and b 2 R d is a bias vector. z s i ; r s i are the reset and the update gates, which decide what information to be preserved and discarded, respectively. σ(·) is the sigmoid function, and is the elementwise multiplication operator.ṽ s i 2 R d represents the candidate state of node v i . And the final state v s i is the combination of the previous hidden state and the candidate state under the control of the update gate. However, a GNN with multiple layers is prone to overfitting and over-smoothing. We utilize dropout (Srivastava et al., 2014) at each layer and highway network (Pan et al., 2020) to alleviate the problems. Concretely, we aggregate the output of the last layer of the module with the initial input as the final item representation in the following.
where W s 2 R 2dÂd are learnable parameters and ; ½ is the concatenation operation.

Global context session graph module
The GCSG module aims to learn more powerful item embedding by modeling complex high-order relationships among items through context vectors over different sessions. Firstly, we formulate a cross-session item graph as G ¼ ðV g ; E g Þ, in which nodes V g and edges E g are generated from historical sessions. Each session sequence S is viewed as a path which starts from v s 1 and ends at v s m in graph G. Unlike existing methods, we add a global representation for every session in the graph G, which is called a master node or a context vector (Gilmer et al., 2017;Battaglia et al., 2018). The context vector builds up a representation for the session as a whole and have a bidirectional edge to all other nodes in the session, providing a natural way to pass information between items that are not directly connected. We call the modified graph GCSG (global context session graph), formulate it as G g ¼ ðV g ; E g ; C g Þ, where C g means context vectors. A simple illustration of GCSG is shown in Fig. 3. Initialization First, we initialize each item v ∈ V in a unified embedding space, yielding a representation of the item v 2 R d as mentioned in OSG module. In order to incorporate the sequential information into a context vector, we also add a learnable position embedding p 2 R dÂm to the item representation. More specifically, for each session We then take the representation of the last item v m as the local embedding of the session S, i.e., s l = v m . After this, we aggregate all node vectors of the session as the global preference embedding s g . Adopting the softattention mechanism to learn their priority, we hybrid the local and the global embedding s l and s g as below.
where W 0 2 R d , W 1 ; W 2 2 R dÂd and W 3 2 R dÂ2d are the learnable parameters to control the weights of items, and b 2 R d is a bias vector. Finally, we use the hybrid embedding s h as an initialization of the corresponding context vector c s , i.e., c s = s h . This strategy combines the long-term preference and the recent interests of the session, building up a good representation for the session as a whole.

Node updating
To learn high-order items transitions information from sessions, inspired by Pan et al. (2020), we alternately updated item embedding and context vectors on the global context session graph G g . For each node in G g , the information was collected and propagated from two sources: adjacent items and context vectors. First, we handle the graph G g without considering context vectors in the same way we handle OSG. The construction of the incoming matrix and the outgoing matrix of the graph G g is similar to the OSG. We also concatenate two matrices to get the matrix M. For each node v i in graph G g at layer l, we update node representation v g;l i from adjacent nodes across different sessions according to Eq. (1). We then added a dropout layer to alleviate over-fitting. Since items may appear in Sessions: Figure 3 An example of GCSG consisting of three sessions, where c 1 ; c 2 ; c 3 are the context vectors corresponding to sessions s 1 ; s 2 ; s 3 , respectively.
Full-size  DOI: 10.7717/peerj-cs.1055/ fig-3 multiple sessions, each node in G g may be connected to multiple context vectors. Suppose that the context vectors of the sessions containing node v i form the set c i = [c i,1 , c i,2 , …, c i,n ], where n is the number of sessions containing the node v i . We first calculate the similarity α l i,j of the node v i and the context vector c i,j in layer l with an attention mechanism as below.
where W q1 ; W k1 2 R dÂd are the trainable parameters, ffiffiffi d p is used for scaling the coefficient, v l i and c lÀ1 i;j are the representation of node v i at layer l and the context vector representation at layer l − 1, respectively. Then we obtain the representation of the node v i from context vectors, which is a linear combination of c l j with the similarity a l i;j as a weight (j = 1, ⋯, n). After this, we calculate the level priority b l i by performing a nonlinear map on the representation vectors v g;l i and v c;l i to balance the importance of the two vectors.
where v g;l i is obtained from adjacent items and v c;l i is obtained from the context vectors at layer l, W 4 2 R 2dÂd are learnable parameters. Then, applying a gate mechanism, we integrate the information from adjacent nodes and the related context vectors as follows.
where v l i is the representation of the node at layer l. Finally, we aggregated the output of the last layer and the initial input of the module similar to Eq. (2) to obtain the final item representation v g i .

Context vector updating
For each context vector in the graph G g , we only use the representations of items in their corresponding session to obtain the representation of the context vector. First, we assigned different degrees of importance to each node v i as below.
where W q2 ; W k2 2 R dÂd are the trainable parameters, c l j;i denotes the importance of the ith item to j-th sessions at layer l. We then perform a linear combination of the item representations, and aggregate the updated context vector c l j and its l − 1 layer representation c lÀ1 j as follows.
where W 5 2 R dÂ2d are learnable parameters. In the same way as the final step of the node updating, we also use a highway network to combine the initialization of the context vector in the module and the output of the last layer to obtain the final representation c j .

Self-supervised contrastive learning module
To improve the robustness of the model, we integrated self-supervised contrastive learning into the GCSG module. Since data augment methods are not the main concern of this study, we simply use the edge drop strategy to get an augmented graph of GCSG. Give a mini-batch of sessions fs u g N u¼1 , we apply the edge drop on the GCSG G g to obtain an augmented graph G aug g . We view the same session in the original GCSG and the augmented graph ðs n ; s aug n Þ as a positive pair, and the other 2(N − 1) sessions in two graphs are considered as negative samples. For each session pair ðs n ; s aug n Þ, their updated context vectors are ðc n ; c aug n Þ. Since the context vectors can be viewed as an overall representation of the session, the updated context vectors obtained from the session pair can be naturally treated as a pair of positive samples. We adopted the InfoNCE loss (van den Oord, Li & Vinyals, 2019) of the context vectors in the two graphs as the optimization object defined below.
L ssl ðc n ; c aug n Þ ¼ À log expðsimðc n ; c aug n ÞsÞ P 2N m¼1 expðsimðc n ; c aug m Þ=sÞ where sim(·, ·) is the similarity function, e.g., dot-product, and τ is a hyper-parameter that controls the scaling. We use the self-supervised contrastive learning (SSL) loss on the context vector level instead of the item level to strengthen the robustness of the whole model.

Session representation and prediction layer
For each item v j , we have two representations: One is obtained from the OSG module, and the other is obtained from the GCSG module, as mentioned before. Then, the final representation of the item is computed by sum pooling as follows: where μ is a hyper-parameter to control the ratio of the representation learned from OSG module. Next, we calculate the representation of each session s i in the same way as we initialize the context vectors by Eq. (3). We then obtain the final recommendation probability of the item as below.
We used the cross-entropy of the prediction resultsŷ ¼ fŷ 1 ; Á Á Á ;ŷ jVj g and the ground truth labels y as the main loss defined in the following: We then combine the SSL loss with the recommendation loss to jointly optimize the recommendation task and the self-supervised task as follows: where λ is a hyper-parameter to control the ratio of contrastive SSL, the SSL loss is used as a regularization term to improve the effectiveness and the robustness of the whole model.

EXPERIMENTS Experimental settings
We conducted our experiments on three benchmark datasets, Diginetica, Tmall and RetailRocket. Following the previous work , we filtered out sessions of length 1 and items that appear less than five times over all datasets. We set the sessions of the latest data (such as, the data of the last week) as the test data, the remaining historical data for training and validation. Furthermore, for a session S = [v 1 , v 2 , …, v m ], we generated a series of sequences and labels ([v 1 is the generated sequence and v m denotes the label of the sequence. The statistics of datasets are summarized in Table 1. We adopted two widely used ranking-based metrics: P@K and MRR@K to evaluate the recommendation performance. A P@K score measures whether a target item is included in the top-K list of recommended items, and a MRR@K score considers the position of a target item in the list of recommended items. Higher metric values indicate better ranking accuracy. Moreover, we compare our model with the following session recommendation models to justify the effectiveness of our model. FPMC (Rendle, Freudenthaler & Schmidt-Thieme, 2010) combined the matrix factorization and Markov chain for recommendation. GRU4Rec (Hidasi et al., 2016) uses Gated Recurrent Unit (GRU) to model user sequences for session recommendation. NARM (Li et al., 2017) employs RNNs with attention mechanism to capture user's main purpose. STAMP (Liu et al., 2018) utilizes attention layers to capture the general preference and the current interests of the last click of the current session. SRGNN (Wu et al., 2019) utilizes the gated graph neural networks to update item embeddings and uses the attention mechanism to compute session representations. GCE-GNN  constructs two types of session graphs to capture local and global information. SGNN-HN (Pan et al., 2020) applies a star graph neural network to model transition relationship between items. S 2 -DHCN (Xia et al., 2021b) constructs a hypergraph and a line graph to learn interand intra-session information and uses self-supervised learning to provide complementary information.
COTREC (Xia et al., 2021a) construct two views to capture inter-and intra-session information and use a co-training strategy to iteratively select and evolve pseudo-labels as informative self-supervision examples.
The hyperparameters were selected on the validation set, which was randomly selected from the training set with a proportion of 10%. For a general setting, the embedding size is 256, the batch size is 1,024, and each session is truncated within a maximum length of 20. We adopt the Adam optimizer with an initial learning rate 1e −3 and a decay factor of 0.1 for three epochs. Moreover, the L 2 regularization is 10 −5 , the scale ratio τ is 0.2, the ratio of dropping edges is 0.3, and the ratio for all dropout layers is 0.1.
For the baseline models, we reported their results in their original papers directly, if available, since we use the same datasets and evaluation metrics. We use well-reproduced results from the literature for some models without public code data or using different datasets. We can find the results of FPMC, GRU4Rec, STAMP, and SRGNN models in Xia et al. (2021aXia et al. ( , 2021b which are also the baseline models in our study. In addition, since the public session recommendation datasets are usually split according to time, the distribution of samples at the latter positions in the training data is more similar to the test data than the samples at the former positions (Guo et al., 2022). Therefore, recommendation methods based on constructing graphs for individual sessions, such as SGNN-HN, without shuffling the model will fit better. However, for methods that build graphs based on multiple sessions, not shuffling can lead to label leakage during testing. For fairness, we rerun the source code of the SGNN-HN model by shuffling the training datasets. Since we could not find the result of GCE-GNN on the "RetailRocket" dataset, we reran GCE-GNN and SGNN-HN and adjusted the hyperparameters of the models by grid search, and reported the average results on 10 random seeds.

Results and analysis
The experimental results of overall performance are present in Table 2. In Table 2, our model (SGC-GNN) consistently achieved a good performance (statistically significant) on three datasets with both evaluation metrics, verifying our model's superiority. From the results, we can draw the following conclusions.
The methods (i.e., GRU4REC, NARM, STAMP, SR-GNN) that take into account temporal information achieve better results than the traditional methods (i.e., FPMC). It demonstrates the importance of sequential effects for SBR. Moreover, all methods using deep learning techniques perform well, which indicates the powerful ability of deep learning models in SBR. Graph-based methods all achieve better results than the RNN-based methods, demonstrating the ability of GNNs to model session data. Besides, the methods (i.e., GCE-GNN, S 2 -DHCN, COTREC) which capture different levels (inter-and intra-level) of information achieve better results than SRGNN, which only consider intra-session information. It demonstrates the usefulness of different levels of information for predicting the user's intention in SBR. Our proposed model SGC-GNN outperforms all the baselines on all datasets. In particular, on both Tmall and RetailRocket, our model achieves significant improvement compared to the other methods, showing the effectiveness of the proposed model. The improvement of the SGC-GNN model against the baselines mainly comes from three aspects. The first one is the proposed global context session graph (GCSG). By introducing a global context vector as representative nodes for each session on the crosssession graph, GCSG can help learn the relationship between every two items in a session and the high-order relationships between non-adjacent items in different sessions. Thus, each node can obtain much information and learn richer node representations. The second is using a unified model to improve the recommendation performance of the current session. Moreover, the last one is using self-supervised contrastive learning to improve the robustness of the model. At the same time, other cross-session approaches suffer from reduced model robustness due to the large amount of noisy information introduced by the construction of cross-session graphs.

Notes:
Best performing method is shown in bold. The second best performing method is shown with an underline. * Indicates the statistical significance for p < 0.01 compared to the best baseline method with paired t-test.

Ablation study
To investigate the contributions of each component in SGC-GNN, we developed three variants: GNN-NC, SGC-GNN-NL and GC-GNN. In GNN-NC, we removed context vectors and the self-supervised learning (SSL) module. In SGC-GNN-NL, we removed the session-level graph OSG. GC-GNN represents the version without the SSL module. We show the results of these variants compared to full GCS-GNN in Table 3 on two datasets Tmall and RetailRocket. We can observe that when the global context vectors are removed, there is a significant decrease in both metrics. It shows that the global context vectors are very helpful to performance improvement. Also, the SSL module effectively improves the model's performance. Without the SSL module, the two metrics have different degrees of decrease on both datasets.

Impact of initialization of context vectors
To investigate the effectiveness of the initialization method of context vectors, we compared it with the average pooling initialization. From the results in Fig. 4, we can see  that the initialization method we used for context vectors works significantly better than the average pooling initialization, proving the effectiveness of our proposed initialization method. Our proposed context vector initialization method assigned different weights to each item in the session instead of simply averaging the aggregates. The learned representation is more effective as a global representation of the session.

Impact of self-supervised learning
We introduced a hyper-parameter λ to control the magnitude of self-supervised learning.
To investigate its influence, we reported the performance with a set of representative λ values in {0, 0.01, 0.1, 0.3, 0.5, 0.7, 1} on Tmall and Diginetica. According to the results presented in Fig. 5, the recommendation task achieves good gains when jointly optimized with the SSL task. The proposed self-supervised contrastive learning module performs data augmentation on the cross-session graph and then imposes InfoNCE loss on the generated global context vectors, enabling the model to learn more essential features and make it more robust.

Impact of OSG module
To investigate the impact of the ratio of OSG module, we report the performance with a set of representative μ values in {0, 0.1, 0.3, 0.5, 0.7, 1} on RetailRocket and Diginetica. From the results in Fig. 6, we can see the effectiveness of the OSG module, while the model achieves better performance when the ratio μ takes a small value. With a unified model, we can aggregate the item embedding learned from the global and session levels to improve the current session's recommendation performance. Efficiency We evaluated the training efficiency of SGC-GCN and its variant GC-GNN. Since COTREC and DHCN models require an older version of the environment to run, we chose to compare the efficiency with the SRGNN, SGNN-HN, and GCE-GNN methods. For a fair comparison, we set the batch size to 100 and the hidden size to 100 for all methods instead of putting them to 1,024 and 256, respectively, because large batch size can cause GCE-GNN to run out of memory. All experiments were conducted on a single Nvidia RTX A4000 GPU and the same computation environment. All methods were trained with 10 epochs, and we reported the average training time per epoch. The results are shown in Table 4. From Table 4, we can observe that SGC-GCN performs worse than other methods on the Tmall dataset, but on the Diginetica dataset, our model has about the same time as other methods. Both SGC-GNN and GCE-GNN models build larger session graphs containing multiple session information. Still, the GCE-GNN has a more complex structure, making it suffer from the out-of-memory problem when performing on  RetailRocket on our RTX A4000 GPU. Moreover, if we consider the model GC-GNN that removes the self-supervised contrastive learning, its time consumption is similar to that of SRGNN and SGNN-HN. However, the difference in the training time of SGC-GNN is acceptable considering the performance improvement.

CONCLUSION
Existing graph-based recommendation methods have difficulty modeling the relationship between non-adjacent items and introduce noisy information in constructing the global graph, which reduces the robustness of the model. In this study, we proposed a selfsupervised global context graph neural network model SGC-GNN to solve this problem. In the model, we used global context vectors as a bridge for passing information between nonadjacent items in different sessions, allowing the model to learn a richer representation of nodes. At the same time, to address the problem of introducing a large amount of noisy information and thus reducing the robustness of the model due to the construction of cross-session graphs, we designed a self-supervised contrastive learning module that effectively improves the robustness of the model by augmenting the data and imposing InfoNCE loss on the global context vectors as an auxiliary loss of the model. Finally, we combined session-level information with global-level information through a unified model to enhance the feature presentations of items. Experimental results and analysis demonstrate the superiority of the proposed model.

ADDITIONAL INFORMATION AND DECLARATIONS Funding
This work is supported in part by the National Nature Science Foundation of China (Nos. 61876016) and the National Key R&D Program of China (2018AAA0100302). The funders had no role in the study design, data collection, analysis, publication decision, or manuscript preparation.