Session-aware Recommendation: A Surprising Quest for the State-of-the-art

Recommender systems are designed to help users in situations of information overload. In recent years, we observed increased interest in session-based recommendation scenarios, where the problem is to make item suggestions to users based only on interactions observed in an ongoing session. However, in cases where interactions from previous user sessions are available, the recommendations can be personalized according to the users' long-term preferences, a process called session-aware recommendation. Today, research in this area is scattered and many existing works only compare session-aware with session-based models. This makes it challenging to understand what represents the state-of-the-art. To close this research gap, we benchmarked recent session-aware algorithms against each other and against a number of session-based recommendation algorithms and trivial extensions thereof. Our comparison, to some surprise, revealed that (i) item simple techniques based on nearest neighbors consistently outperform recent neural techniques and that (ii) session-aware models were mostly not better than approaches that do not use long-term preference information. Our work therefore not only points to potential methodological issues where new methods are compared to weak baselines, but also indicates that there remains a huge potential for more sophisticated session-aware recommendation algorithms.


Introduction
Recommender systems (RS) can nowadays be found on many modern e-commerce and media streaming sites, where they help users find items of interest in situations of information overload. One reason for the success of RS lies in their ability to personalize the item suggestions based on the preferences and observed past behavior of the individual users. Historically, researchers have therefore strongly focused on situations where only information about long-term user preferences is available, e.g., in the form of item ratings. Only in recent years, more focus was put on the problem of sessionbased recommendation, where the system has to deal with anonymous users and therefore can base its recommendations only on a small number of interactions that are observed in the ongoing session.
Due to the practical relevance of this problem, a variety of technical approaches to session-based recommendation were proposed in the past few years, in particular ones based on deep learning (neural) techniques, see [28,47]. Implicitly, these methods try to make recommendations by guessing the user's short-term intent or situational context only from the currently ob-served interactions. However, while it is well known that the current intents and context may strongly determine which items are relevant in the given situation [18], information about long-term preferences of users, if available, should not be ignored. In particular, the consideration of such information allows us to make session-based recommendations that are personalized according to long-term preferences, a process which is also called session-aware recommendation [35].
Session-aware recommendation problems are recently receiving increased interest. Today, the research literature is however still scattered, which makes it difficult to understand what represents the state-of-the-art in this area.
One particular problem in that context is that existing works do not use a consistent set of baseline algorithms in their performance comparisons. Some works, for example, mainly compare session-aware models with session-based ones, i.e., with algorithms that do not consider long-term preference information, e.g., gru4rec [15]. Several other works use sequential recommendation algorithms, e.g., fossil [13], as a baseline. These are algorithms that consider the sequence of the events but are usually designed for settings where the input is a time-stamped user-item rating matrix and not a sequential log of observed interactions. Only in a few works, previous session-aware algorithms are considered in the evaluations. One example is the method by Phuong et al. [34], which uses the hgru4rec method [36] as a baseline.
Finally, almost all works include some trivial baselines, e.g., the recommendation of popular items in a session.
With this work, our goal is to close the research gap regarding the stateof-the-art in session-aware recommendation. For this purpose, we have con-ducted extensive experimental evaluations in which we compared five recent neural models for session-aware recommendation with (i) a set of existing neural and non-neural approaches to session-based recommendation, and (ii) heuristic extensions of the session-based techniques that, e.g., make use of reminders [20] or consider interactions that were observed immediately before the ongoing session.
Regarding the baseline techniques, we in particular considered methods based on nearest neighbors techniques, which previously proved to be very competitive in session-based recommendation scenarios [29]. All investigated techniques were compared by extending the evaluation framework shared in [27]. For reproducibility purposes, we share all data and code used in the experiments online 1 , including the code for data preprocessing, hyperparameter optimization, and measuring.
The results of our investigations are more than surprising. In the majority of cases, and on all four considered datasets, heuristic extensions of existing session-based algorithms were the best-performing techniques. In many cases, even plain session-based techniques, and in particular ones based on nearest-neighbor techniques, outperform recent session-aware models even though they do not consider the available long-term preference information for personalization. With our work, we therefore provide new baselines that should be considered in future works on session-aware recommendation. On a more general level, our observations also point to potential methodological issues, where new models are compared to baselines that are either not properly optimized, that do not leverage all available information, or that are rather weak for the given task. Similar observations were previously made in the field of information retrieval and in other areas of recommender systems [3,7,49].
On a more positive note, our evaluations suggest that there is a huge potential to be tapped by more sophisticated (neural) algorithms that combine short-term and long-term preference signals for session-aware recommendation. An important prerequisite for progress in this area however lies in an increased level of reproducibility of published research. A side observation of our research is that despite some positive developments in recent years, where researchers increasingly share their code on public repositories, it in many cases still remains challenging to reproduce existing works.
The paper is organized as follows. In Section 2, we discuss relevant previous works. Section 3 describes our research methodology in more detail with respect to the compared algorithms, the evaluation protocol, and the performance measures. The results of our experiments are reported in Section 4, and we discuss research implications and future works in Section 5.

Previous Work
Historically, recommender systems research focused strongly on the problem of rating prediction given a user-item rating matrix, a setting which is also known as the "matrix completion" problem [39]. In this original collaborative filtering problem setting, the order of the ratings or the time when they were provided were not considered in the algorithms. Soon, however, it turned out that these aspects can matter, leading to the development of time-aware recommender systems [5], e.g., in the form of the timeSVD++ algorithm as used in the Netflix Prize [24].
Ten years after the Netflix Prize, the focus of research has mostly shifted from rating prediction to settings where mainly implicit feedback signals by users (e.g., purchase or item-view events) are available. Moreover, instead of considering the user-item matrix as the main input, recent research more often focuses on settings where the main input to a recommendation algorithm are sequential logs of recorded user interactions. The family of approaches that relies on such types of inputs are referred to as sequence-aware recommender systems [35].
Within this class of sequence-aware recommender systems, we can differentiate between three main categories of problem settings and algorithmic approaches.
• Sequential Recommender Systems: Unlike the other types of sequenceaware approaches discussed here, these systems are rooted in the tradition of relying on a user-item rating matrix as an input. The particularity of such systems is that the sequence of events is first extracted from the time-stamped rating matrix, and the goal then usually is to predict the immediate next user action (e.g., the next Point-of-Interest that a user will visit), given the entire preference profile of the user.
• Session-based Recommender Systems: The input to these systems are time-ordered logs of recorded user interactions, where the interactions are grouped into anonymous sessions. Such a session could, for example, correspond to a listening session on a music service, or a shopping session on an e-commerce site. One particularity of such approaches is that users are not tracked across sessions, which is a common problem on websites that deal with first-time users or users that are not logged in. The prediction task in this setting is to predict the next user action, given only the interactions of the current session. Today, session-based recommendation is a highly active research area due to its practical relevance.
• Session-aware Recommender Systems: This category is also referred to as personalized session-based recommender systems. It is similar to session-based recommendation in that the user actions are grouped into sessions. Also the prediction goal is identical. However, in this problem setting, users are not anonymous, i.e., it is possible to leverage information about past user sessions when predicting the next interaction for the current session. Figure 1 illustrates the differences between session-based and sessionaware recommendation scenarios. 2 In both problem settings, the recommendation problem consists of predicting which action a user will do next in an ongoing session. In an e-commerce setting, for example, the problem is to predict which items are relevant for the user, given the last few observed interactions in an ongoing shopping session. The difference between the two scenarios however is that in one case we have long-term preference information about the user (session-aware recommendation), whereas in the other case such information is not available. Technically, this usually means that 2 Remember that sequential problems, which are not based on the concept of a session, are not in the scope of this work.

Session-based Recommendation
Training data: Past anonymous sessions of the user community  Unfortunately, the terminology in the literature is not entirely consistent.
In this work, we will therefore use the categorization and terminology as described above to avoid confusion. Next, we review the main technical approaches in each category.
Sequential Recommendation Approaches. The first comparably simple approaches in this category were based on Markov models, e.g., [32]. Later on, more sophisticated approaches emerged which, for example, combined the advantages of matrix factorization techniques with sequence modeling approaches. Rendle et al. [38] proposed the Factorized Personalized Markov Chain (fpmc) approach as an early method for next-basket recommendation in e-commerce settings, where user interactions are represented as a three dimensional tensor (user, current item, next-item). Kabbur et al. [22] later proposed fism, a method based on an item-item factorization. fism was then combined with factorized Markov chains to incorporate sequential information into the fossil model [12].
In recent years, several sequential recommender systems based on neural networks were developed. Kang and McAuley [23], for example, proposed sasrec (self-attention based sequential model), a method that allows to capture long-term semantics like an RNN. However, through the use of an attention mechanism, it focuses only on a smaller set of interactions to make the item predictions. In the caser method, Tang and Wang [45] embedded a sequence of recent items into latent spaces as an "image" in time, and proposed to learn sequential patterns as local features of the image with the help of convolutional filters. Most recently, Sun et al. [43] proposed bert4rec, which employs a deep bidirectional self-attention mechanism to model user behavior sequences.
In this work, we do not consider this class of algorithms in our performance comparison because these methods, in their original designs, do not consider the concept of a session in the input data. While it is in principle possible to apply these methods in a particular way for session-based recommendation problems, a previous evaluation shows that sequential approaches are often not competitive with techniques that were specifically designed for the problem setting. Specifically, the evaluation presented in [27] included a number of sequential approaches, namely fpmc, mc, smf, bpr-mf, fism, and fossil. 3 Their findings showed that (i) these approaches either are generally not competitive in this setting or only lead to competitive results in a few specific cases and (ii) that nearest neighbor recommenders outperform them in terms of prediction accuracy.
Session-based Recommendation Approaches. While there exist some earlier works on session-based recommendation, e.g., in the context of website navigation support and e-commerce [31,42], research on this topic started to considerable grow only in the mid-2010s. These developments were particularly spurred by the release of datasets in the context of machine learning competitions, e.g., at ACM RecSys 2015. At around the same time, deep learning methods were increasingly applied for recommendation problems in general. The first deep learning approach to session-based recommendation was gru4rec [15], which is based on Recurrent Neural Networks. Later on, various other types of neural architectures were explored, including attention mechanisms. convolutional neural networks, graph neural networks or hybrid architectures, see [47] for an overview.
Recent work however indicates that in many cases much simpler methods can achieve similar or even higher performance levels than today's deep learning models. Most recently, Ludewig et al. [29] benchmarked several of the mentioned neural methods against conceptually simpler session-based 3 fpmc: Factorized Personalized Markov Chains [38], mc: Markov Chains [32], smf: Session-based Matrix Factorization [27], bpr-mf: Bayesian Personalized Ranking [37], fism: Factored Item Similarity Models [22], fossil: FactOrized Sequential Prediction with Item SImilarity ModeLs [12]. algorithms which, for example, rely on nearest-neighbor techniques. Quite interestingly, their analyses and similar previous works [8,27] not only show the strong performance of conceptually simple techniques, but also revealed that two of the earlier neural methods, gru4rec and narm, often perform better than more recent complex techniques. In the performance comparison in this present work on session-aware recommendation, we include several techniques for session-based recommendation as baselines. This allows us to assess the added value of considering long-term preference information compared to a situation where such information is not available or not leveraged.
Session-aware Recommendation Approaches. The literature on session-aware recommendation is still quite sparse. An early approach is discussed in [20].
One main goal of their work was to understand the relative importance of short-term user intents when visiting an e-commerce site compared to the long-term preference model. Their analyses, which were based on a large but private e-commerce dataset, emphasized the importance of considering the most recent observed user behavior when recommending. Furthermore, it also turned out that reminding users of items that they have viewed before can be beneficial, both in terms of accuracy measures and business metrics.
While the work in [20] relied on deep learning for the final predictions in one of their models, the core of the proposed technical approach was based on feature engineering and the use of side information about the items to recommend. One of the earliest "pure" deep learning techniques for sessionaware recommendation was proposed by [36]. Technically, the authors based their work on gru4rec, and they used a second, parallel GRU layer to model information across sessions, resulting in a model called hgru4rec. Their analyses showed that incorporating long-term preferences can be beneficial, i.e., hgru4rec was outperforming an early version of gru4rec in their experiments.
In the same year, Ruocco et al. [40] proposed the iirnn model. Like hgru4rec, this model uses an RNN architecture and extends a sessionbased technique to model inter-session and intra-session information. Like in the case of hgru4rec, the authors investigate the value of considering long-term preference information by comparing their method to session-based techniques. RNNs were later on also used in the nsar model [33] to encode session patterns in combination with user embeddings to represent long-term user preferences across sessions. In their experiments, the authors not only compare their model to session-based techniques, but also to hgru4rec as a representative of a session-aware approach.
A number of neural architectures other than RNNs were proposed in recent years. Hu et al. [17], for example, combine the inter-session and intrasession context with a joint context encoder for item prediction in the ncsf approach. In the shan model [50], in contrast, the authors leverage a twolayer hierarchical attention network to model short-term and long-term user interests. In the swiwo [16] approach, the authors were inspired by language modeling approaches like word2vec, where the underlying idea is that items can be seen as words, hence, predicting a relevant word based on context information is equivalent to recommending a relevant item in an ongoing session. Finally, Cai and Hu [4] proposed the samr method, which leverages a topic-based probabilistic model to define the users' listening behavior.
Unlike most previous works, which compare a newly-proposed session-aware model with previous session-based ones, our work compares sessionaware methods against each other. Furthermore, we benchmark several of these recent session-aware methods with (i) existing session-based techniques and (ii) extended versions of them that also consider long-term preference information. We describe our research methodology next.

Research Methodology
In this section, we describe which algorithms we selected for inclusion in our comparison. Moreover, we provide details about the experimental configuration in terms of the evaluation protocol and the used datasets. As mentioned, all datasets and the code used in the experiments are shared online to ensure reproducibility.

Compared Algorithms
In our experiments, we compare neural session-aware algorithms with a number of baselines. Details about the algorithms are provided next.

Neural Session-Aware Algorithms
We identified five recent neural approaches, which we integrated into our evaluation framework using the source provided by the authors: hgru4rec, iirnn, shan, ncsf, and nsar.
• hgru4rec: This method [36] is based on the gru4rec algorithm. • shan: This model [50] uses a two-layer hierarchical attention network to learn a hybrid representation for each user that combines the longterm and short-term preferences. It first embeds sparse user and item inputs into low-dimensional dense vectors. The long-term user representation is a weighted sum over the embeddings of items in the long-term item set. By learning the weights, the first attention layer learns user long-term preferences. The second attention layer returns the final user representation by combining the long-term user model and the embeddings of the items in the short-term item set.
• ncsf: This session-aware neural method [17] has three components: (i) the historical session encoder to represent the inter-session context, (ii) the current session encoder to represent the intra-session context, and (iii) the joint context encoder to integrate the information of the intra-session context and the inter-session context for item prediction.
• nsar: This method [34] utilizes RNNs to encode session patterns (short-term user preferences) and user embeddings to represent longterm user preferences across sessions. It supports different ways of integrating user long-term preferences with the session patterns, where user embeddings can either be integrated with the input or the output of the session RNNs. Moreover, with the help of a gating mechanism, the contribution of each component can be fixed or adaptive.
To avoid a bias in the algorithm selection, we applied the following procedure to identify algorithms for inclusion in our experiments. An initial set of candidate works was retrieved through a search on Google Scholar using search terms like "session-aware recommendation" or "personalized sessionbased recommendation". We inspected the returned results to see if the papers fulfilled our inclusion criteria. Besides being actually a work on sessionaware recommendation according to the above definition, we required that the source code of the method was publicly available and could be integrated into our Python-based evaluation framework 4 . Moreover, we only considered papers that had undergone a peer review process, i.e., we did not include non-reviewed preprints. Finally, we included only works that did not consider side information about the items. As a result of this last constraint, we did not include works like [44] for the domain of news recommendation.
In Table 3.1.1, we show to which baselines the selected neural approaches were compared in their original publications. Note that in this table only the last two rows, hgru4rec and swiwo represent session-aware techniques.
Our analysis furthermore shows that researchers use a variety of baselines in their experiments, which contributes to the difficulty of understanding what represents the state-of-the-art.

Neural and Non-Neural Session-based Baselines
We selected the baselines according to the results of [29]. Note that while gru4rec and narm are not the most recent neural methods, the analysis in [29] showed that they are highly competitive among the neural models for different datasets.
• gru4rec: This neural model [15] employs RNNs based on Gated Recurrent Units (GRU) for the session-based recommendation task. It introduces several modifications, including a ranking loss function, to classic RNNs to adapt it for the recommendation task in the sessionbased setting. The authors later on improved the model with an alternative loss function and by applying further refinements, [14]. We include the latest version of gru4rec in our experiments.
• narm: This model [26] extends gru4rec. It utilizes a hybrid encoder with an attention mechanism to model the sequential behavior of users and capture their main intent of the current session. A bilinear matching scheme is used to compute the recommendation scores of each candidate item based on a unified session representation.
co-occurrences in the training sessions. It considers the order of the items in a session as well as the distance between them when scoring the items.
• vsknn: This nearest-neighbor baseline for session-based recommendation was proposed in [27], and it is based on the sknn method [19].
It first finds past sessions that contain the same items as the current session. The recommendation list is then generated using items that occur in the most similar sessions. This method considers the order of the items while computing both session similarities and item scores.
Moreover, it applies the Inverse-Document-Frequency (IDF) weighting scheme to put more emphasis on less popular items.
• stan: The Sequence and Time Aware Neighborhood method was proposed in [8]. It improves sknn by considering three additional factors for making recommendations: (i) the recency of an item in the current session, (ii) the recency of a past session w.r.t. the current session, and (iii) the distance of a recommendable item w.r.t. a shared item in the neighboring sessions.
• vstan: This nearest-neighbor session-based recommendation algorithm combines all the extensions to sknn from stan and vsknn in a single approach; proposed in [29].

Extensions of Session-based Baselines
We experimented with three simple ways of extending session-based algorithms in a way that they consider past preference information.
• Extend -Extending the current session with recent interactions: In case there is little information available about the ongoing session, e.g., when only a few first clicks are recorded, we extend the current session with interactions that we observed in the previous sessions of the user.
• Boost -Emphasizing previously seen items: In some domains, repeated interactions with already seen or consumed items are not uncommon. We apply a simple "boosting" approach to slightly increase the scores computed by an underlying algorithm in case an item has appeared previously in the interaction history.
• Remind -Applying reminding techniques: In [20], the authors proposed a number of "reminder" techniques to emphasize items the user has seen before. The general approach is to determine and score a set of candidate items that the current user has interacted with in the past. We considered different strategies that were inspired by [20] in our evaluations.
Extend. We implemented this strategy as follows. First, we choose a value for the "desired session length" d, which is a hyperparameter to be determined on the validation set. In case an ongoing session has fewer interactions than d, we extend the current session with previous interactions from the same user until the session length d is reached or no more previous interactions exist. The extension is done by simply prepending the elements of previous interactions to the current session in the order they appeared in the log.
Boost. This simple approach can in principle be applied to any algorithm that returns scores. Methods like sr and vsknn, for example, return scores based on item co-occurrences and the positions of the co-occurring items, as described in [27]. In our experiments, we used a hyperparameter b as a boost factor. Technically, we look up each item that is recommended by the underlying method and check if it occurred in the interaction history of the current user at least once. In case the item appeared previously in the history, we increase the original score by b %.
Remind. Different reminding strategies were proposed in [20]. In our experiments, we tested a number of alternative ways to select and score items to consider as reminders. In all of them, the reminder list is created by adding items of the user's last p sessions, which is a hyperparameter to be determined on the validation set. The assumption is that very old sessions at some stage might become irrelevant and should not be considered anymore.
The following reminder strategies are the best performing ones according to our experiments.
Interaction Recency. In this strategy, we use a decay function to score reminder items, i.e., items in the interaction history of the user, based on the time of the user's last interaction with them: Here, T c is the timestamp of the current session, and T i is the timestamp of the last interaction of the user with the item i.
Session Similarity. In this strategy, the items of the last P sessions of the user are scored based on the similarity score(s) of the current session and the previous session(s) that they belong to. This strategy is based on a nearest-neighbor recommendation method to calculate the similarity scores between sessions: Here, S c denotes the current session, and S p denotes a session in the set of the last P past sessions of the user. Sim is the similarity function of the nearest-neighbor algorithm, and the indicator function 1 p (i) returns 1 if the session p contains item i and 0 otherwise.
In our present work, we used a hybrid approach to combine aspects of interaction recency (IRec), session similarity (SSim), and the item's relevance score (RelScore), as determined by a recommendation algorithm, in a weighted approach. The overall ranking score for an item i is therefore defined as Here, W1, W2, and W3 are hyperparameters to tune, and the individual scores are normalized before they are combined. Note that for algorithms that are not based on nearest-neighbors, we do not compute a SSimScore value, and we thus set W3 to 0.

Datasets
We conducted our evaluations on four public datasets from three different domains: e-commerce, social networks and music. interactions of type delete from the dataset for our evaluation, as done in [36], because they are considered as negative interactions.
• COSMETICS : An e-commerce dataset containing the event history of a cosmetics shop for five months. 5 The interactions include four types of actions: click, view, add-to-cart, remove-from-cart, purchase.
Since the methods examined in our work are not designed to consider multiple types of actions, we only used the interactions of one type (item views) for our evaluation, as was done also in various previous works [15,26]. Furthermore, we randomly sampled 10% of the users of this large dataset because of scalability issues of some of the neural methods.
• LASTFM : A music dataset that contains the entire listening history of almost 1,000 users during five years. The dataset was retrieved from the online music service Last.fm 6 .
For each dataset, we first partitioned the log into sessions by applying a commonly used 30-minute user inactivity threshold 7 . We kept multiple interactions with the same item in one session because repeated recommendations can help as reminders [20,25]. Many previous works on session-aware recommendation use a single training-test split of the whole dataset or a sample of it for evaluation. Evaluating on only one split of data is risky because of possible random effects. We therefore split each dataset into five contiguous subsets by time and averaged the results across slices as done in [29]. To have about the same number of events for each slice, we skipped the first 500 days of the LASTFM dataset.
Following common practice in the field, we then further pre-processed each slice as follows. We first removed items with less than five interactions in the slice. Then, we removed sessions that contain only one event. For the LASTFM dataset, we also removed sessions with more than 20 events 8 . Finally, we filtered out users with less than three sessions. Table 2 shows the average characteristics of the slices for each dataset after the pre-preprocessing phase.

Evaluation Protocol and Metrics
Creation of Training, Validation, and Test Splits. Since we are given user-IDs for the sessions, we are able to apply a user-wise data splitting approach. 7 The COSMETICS dataset already contains session IDs, but we noticed that there were large inactivity thresholds and some of the original sessions spanned several days. 8 The dataset contains a number of very long sessions with dozens of listening events, and the probability that users actively listened to tracks for many hours seems low. Therefore, we only considered the first 20 elements, which corresponds to a listening session about 1.5 hours for the case of pop music, see also [33,40] for similar approaches. Specifically, like in [11,33,34,36,48], we use the last session of each user as test data. The second-to-last session is used as validation data to tune the parameters in our experiments, see also [33,34,48]. The remaining sessions are considered as training data. This splitting approach allows us to assess the performance both for users with shorter and longer interaction histories.
We finally filter out the items from the validation and test sets of each of the five slices that did not appear in the training set of that slice.
Target Item Selection. We apply a procedure that is commonly used also in session-based recommendation, e.g., in [15] and many other works. Specifically, we iteratively reveal each item after the other in the test session and do the evaluation after each item 9 .
We apply this approach as it reflects the most realistic user behavior in a session. Following previous works [27,28,29], we consider two evaluation scenarios. In one case, we only consider the immediate next item in the test 9 In the implementation of the ncsf method, we found that the authors use a context window, which includes items before and after a given target item t to make the prediction.
In reality, however, we cannot know which items will appear after t. As we could not resolve this issue with the authors, we used an implementation that only uses items appearing before the target item as the context. Coverage and Popularity Bias Metrics. It is well known that factors other than accuracy can impact the effectiveness of a recommendation algorithm in practice [41]. In this work, we consider coverage and the tendency of an algorithm to focus on popular items (popularity bias) as relevant factors.
Coverage tells us how many items actually ever appear in the top-n lists of users. This measure, which is also known as "aggregate diversity" [1], gives us some indication of how strongly personalized the recommendations of an algorithm are. A strong popularity bias, on the other hand, indicates that an algorithm is not focusing much on the long-tail of the item catalog, which, however, can be desirable in practice. We calculate the popularity bias as done in [27]. Specifically, we average the popularity scores of all recommended items. These popularity scores are computed by counting how often each item appears in the training set. To bound their values between 0 and 1, we apply min-max normalization.
Computational Complexity. Training deep learning models is often considered computationally demanding. Nearest-neighbor techniques, on the other hand, have no model-building phase, but the search for neighbors can, depending on the implementation, be computationally complex at run time.
In our experiments, we therefore use the recency-based sampling approach proposed in [19] for all session-based nearest-neighbor approaches. To compare the computational complexity of the various methods, we measured the time that is needed by each algorithm, both for the training and the prediction phases. All these measurements were made on the same physical machine, which was exclusively used for the time measurements. The machine is equipped with a Nvidia Titan Xp GPU and an Intel i7-3820 CPU.
The neural models used the GPU whereas the non-neural techniques only used the CPU.
Hyperparameter Optimization. To obtain reliable results, we systematically and automatically tuned the hyperparameters for each algorithm and dataset.
Technically, we applied a random hyperparameter optimization procedure with 100 iterations to optimize MRR@20 as done, e.g., in [27] 10 . For narm, we however only ran 50 iterations as this method has a smaller set of hyperparameters. For shan, we only ran 9 hyperparameter configurations since they cover all possible value combinations according to the original paper 11 .
For each dataset, we used the slice with the most number of events to tune hyperparameters. 10 We also tried MAP@20 as the optimization target for some approaches, but this did not lead to a different ranking of the algorithms in terms of accuracy. 11 We unsuccessfully contacted the authors regarding the hyperparameter spaces.

Results
Tables 3-6 show the results of our performance comparison of neural and non-neural methods, ordered by the values obtained for the MAP@20 metric.
Here, we correspondingly report the values obtained by applying a cut-off threshold of 20. We performed additional experiments using alternative cut- Note that we do not report all possible combinations of the proposed extensions discussed in Section 3.1.3 for the sake of conciseness. We use the following naming scheme for the different algorithm variants.
• Remind: We denote algorithm variants that were extended with the reminder technique with the postfix "_r". Note that it is not meaningful to incorporate the reminder extensions to session-aware methods, as these models should already be able to implicitly leverage the long-term preference information.
• Extend and Boost: We report the effects of these extensions for the non-neural methods 12 , and we denote the extended method by append- 12 In principle, these mechanisms can also be applied to the neural session-based methods.
Initial experiments with the Extend extension for gru4rec and narm however did not lead to performance gains. Note also that the adaptation of existing neural session-based methods is not the main focus of our work.
ing the postfix "_e" (extend) and "_b" (boost) to the algorithm name, e.g., sr_b, vsknn_eb. These extensions can also be combined with the reminder technique, e.g., vsknn_ebr. For the sr method, only boosting was applied because extending the session is not applicable for this algorithm, which by design only considers the last interaction in a session.
For the sake of clarity, for each model and dataset, we report (i) the combination of the extensions that led to the best performance according to MAP@20 and (ii) the results of the original models without extensions. In case the original model had better performance than the extended ones, we only report the results for the original model.
In our analysis, we focus on the performance comparison of non-neural methods and neural session-aware recommendation techniques. Therefore, in Table 3 to Table 6, the highest obtained values among algorithms of these two families are printed in bold. Moreover, we underline the highest value that is obtained by the other family of algorithms. Stars indicate significant differences (p<0.05) according to a Kruskal-Wallis test between all the models and a Wilcoxon signed-rank test between the best-performing techniques from either category (non-neural or neural session-aware recommendation methods).

Accuracy Results
We can summarize the accuracy results for the individual datasets as follows.    Probably even more surprising is that we find the session-aware meth- Among the session-aware recommendation methods, this time ncsf achieves the best accuracy results for all metrics, even though it was not originally 13 Using alternative loss functions for hgru4rec might help improving its performance.
Such modifications of the original algorithms are however not in the scope of our work. evaluated on this dataset.
Note that the reminders worked generally very well on this dataset. Reminders for example help to improve the performance of the neural sessionbased techniques (gru4rec, narm) to an extent that they sometimes outperform the basic nearest-neighbor techniques. However, when the extensions are also considered for the neighborhood-based techniques, their accuracy results are again much higher than those of the neural techniques.

COSMETICS.
We observe similar results also on the COSMETICS dataset.
Neighborhood-based methods lead to the best results for all accuracy metrics, and the extensions resulted in performance improvements in almost all cases.
Moreover, all methods except sr and narm outperform the session-aware techniques. Looking at the session-aware methods, we notice that ncsf has the best performance in all the accuracy metrics, except for the MRR, and hgru4rec achieves the best results for the Hit Rate (along with ncsf) and the MRR.
LASTFM. The picture for this dataset is slightly different from the others in certain respects. First, while nearest-neighbor approaches again lead to the best results for Precision, Recall, MAP, and HR, these methods are outperformed by all neural session-based and session-aware methods except shan on the MRR. However, none of them, except iirnn, outperforms the simple sr method and its extended version. The best results are obtained by iirnn, which is however one of the earliest session-aware methods. Note, however, that the results by iirnn are only minimally better than sr_b.
Nonetheless, further investigations are needed to understand the reasons why some methods perform very well on the MRR on this particular dataset.
A second difference to the other datasets is that while applying the simple extensions Extend and Boost again proves to be beneficial, the reminder extension does not improve the performance of the original methods in some cases for this dataset. 14 Therefore, we only report the results of the original neural session-based methods (gru4rec and narm).
Summary and Additional Observations. Table 7 summarizes the findings presented in Table 3-6 by reporting the best performing approaches on the individual datasets. In Table 7, we can see that vsknn, stan and vstan with their extensions largely dominate the field across the datasets and measures, and we recommend that these methods are considered as additional baselines in future performance comparisons. Only in one case a method from another category (i.e., IIRNN, an early proposed neural session-aware method) appeared among the best performing approaches.
Looking at the best models among the neural approaches, including both session-based and session-aware, in Table 8, we notice that narm_r and gru4rec_r outperform the session-aware models on the RETAIL, XING, and COSMETICS datasets. On the LASTFM dataset, where the reminder extension did not lead to any performance improvements for the neural session-based models, iirnn is the best one. As a side observation, we see that the ranking of the neural algorithms is often not correlated with the publication year of the methods, i.e., newer methods are not consistently better than older ones.

Coverage and Popularity
Tables 3-6 also report the values of coverage and popularity bias of the algorithms. We can make the following observations.
Coverage. shan consistently has the lowest coverage value across all datasets.
In other words, it has the highest tendency to recommend the same set of Popularity Bias. gru4rec and hgru4rec consistently have the lowest tendency to recommend popular items. In contrast, shan exhibits the highest popularity bias with a considerable difference. Moreover, iirnn and ncsf achieve higher values than all other methods across all datasets. The neighborhood-based approaches are often in the middle. They are therefore not generally focusing more on popular items than neural approaches. Finally, we observe that using the proposed session-aware extensions in most cases leads to a higher popularity bias.

Scalability
For the sake of brevity, we only report the results for two datasets here and provide the results for the other datasets in the appendix. We made all the measurements on the validation slice (i.e., the largest slice in terms of the number of events) for each dataset. Table 9 and Table 10 show the results for the LASTFM and XING, respectively. We selected these two datasets for the discussion because they are larger than the other two datasets and differ in some key characteristics. The LASTFM dataset for example contains the largest number of events, but only a smaller number of users. The XING dataset, on the other hand, contains events from a larger number of users.
The algorithms in these tables are grouped by the type of approach (non- neural, session-based neural, session-aware neural). We report running times of the session-based algorithms (neural and non-neural) both for the bestperforming extension and the corresponding method.
The overall results are similar across all datasets. In terms of the training times, the non-neural models are consistently the fastest because "training" for such models mainly involves the initialization of data structures (nearestneighbor techniques) or the collection of simple count statistics (sr). Interestingly, strong differences can be observed between the neural session-based methods gru4rec and narm, with gru4rec being orders of magnitude faster than narm in the training phase. The performance of the session- Generally, the use of the proposed extensions (Boost, Extend, Remind) leads to increased computation times for training and predicting.
Note, however, that the more efficient basic versions of the nearest-neighbor techniques already outperform all neural session-aware methods in all cases, except one.
Overall, as mentioned above, we observe quite a spread regarding the running times among the neural methods. A detailed theoretical analysis of the computational complexity of the underlying network architectures and individual architecture elements is however beyond the scope of our present work, which focuses on the empirical evaluation of various algorithms in terms of their prediction accuracy. 16 Remember, however, that all nearest-neighbors methods-both the basic and extended versions-outperformed the iirnn method on all accuracy metrics, except MRR.

Implications and Guidelines
Different factors may have contributed to the surprising observations made in this paper. First, we can assume that session-based algorithms, which are often used as baselines to benchmark session-aware ones, have improved over time. This might for example be the case for the hgru4rec method, which most probably used an earlier version of gru4rec both as a session-based baseline and as a building block for the newly proposed method.
Another potential reason may however also lie in methodological issues and problematic research practices that were observed previously not only for more traditional top-n recommendation tasks, but also for other areas of applied machine learning such as information retrieval or time-series forecasting [3,7,30,49]. According to several of these works, one main problem seems to be that researchers often invest substantial efforts to refine and fine-tune their newly proposed methods, but do not invest similar efforts to optimize the baselines to which they compare their new methods. This phenomenon can be seen as a form of a confirmation bias, where researchers mainly seek for evidence supporting their theories and claims, but do not appropriately consider other evidences or indications. In that context, also reproducibility issues can play a role; see also the discussion of reproducibility problems in AI and empirical computer science in general [9,6]. While researchers in recent years more frequently publicly share the code of their newly proposed models, they only very rarely share the code of the baselines or the code that was used to fine-tune all algorithms in the experimental evaluations.
Another related problem can also lie in the choice of the baselines. In particular in recent years we observe that researchers only consider very recent baselines in their evaluations or limit themselves to neural methods [7]. This leads to the effect that long-known or more simple methods are (ii) optimize all baselines models in a systematic and automated way, and to (iii) carefully select and document the hyperparameter ranges and the optimization strategy.

Research Limitations and Future Work
In our work so far, we performed experiments for a variety of domainsincluding e-commerce, music, or job recommendation (social media)-using publicly available and commonly used datasets. Nonetheless, further experiments with additional datasets are needed and are part of our future work.
Such experiments will help us ensure that our findings generalize to other domains and that they are not tied to specific dataset characteristics. Our experiments so far however indicate that the ranking of the family of algorithms is quite consistent across experiments, with today's neural approaches to session-aware recommendation not being among the top-performing techniques, and non-neural techniques working well across datasets.
Nonetheless, an interesting area for future work lies in an improved understanding in which ways dataset characteristics impact the performance and ranking of different algorithms, as was done for more traditional recommendation scenarios in [2]. Moreover, it would be interesting to analyze how sensitive the different algorithms are with respect to slight changes, including both the characteristics of the input data and hyperparameters. In practical environments, in which datasets continuously evolve due to new items and users, algorithms that are more robust with respect to these aspects, might be preferable. Finally, regarding running times-in particular for datasets where there is a larger set of past sessions per user-further performance enhancements of the nearest-neighbor methods might be achieved through additional engineering efforts.
Regarding the set of session-aware algorithms that were benchmarked in our work, we expect that a constant stream of new proposals will be published in the future. As such, our experimental evaluation can only represent a snapshot of the state-of-the-art in the area. Our current snapshot is furthermore limited to works where the source code was publicly available, as was done in [7]. An additional constraint that we applied when selecting baselines was that the code had to be written in Python. In our literature search, we only identified two related works that were not written in Python, [4] and [16]. The method proposed in [4] was written in Java, but no source code was provided. The work presented in [16] was written in MATLAB but is only marginally relevant for our work, as it (i) focuses on diversity aspects and (ii) provides a performance comparison only with session-based or sequential approaches, whereas our work focuses on session-aware techniques.
Finally, we observed that the size of the datasets used in the experiments, both in the original papers and in our work, are small compared to the amounts of data that are available in real-world deployments. The current limitations of the investigated deep learning models might therefore be a result of these dataset limitations, see also [21]. With more data and in particular with data spanning longer periods of time, neural techniques might be favorable and better suited to combine long-term preference signals with short-term user intents than today's methods.

Summary
Our in-depth empirical investigation of five recent neural approaches to session-aware recommendation has revealed that these methods, contrary to the claims in the respective papers, are not effective at leveraging longterm preference information for improved recommendations. According to our experiments, these methods are almost consistently outperformed by methods that only consider the very last interactions of a user. Furthermore, our analyses showed that non-neural methods based on nearest neighbors can lead to better performance results than ones based on deep learning, as was also previously observed for session-based recommendation in [29].
We see the reasons for these unexpected phenomena in methodological issues that are not limited to session-aware recommendation scenarios, in particular in the choice of the baselines in experimental evaluations or the lack of proper tuning of the baselines. On a more positive note, our findings suggest that there are many opportunities for the development of better neural and non-neural methods for session-aware recommendation problems.
We in particular believe that it is promising to look at repeating patterns, seasonal effects, or trends in the data. Moreover, the incorporation of side information (e.g., category information about items) as well as contextual information should help to further improve the prediction performance of new algorithms.