Skip to main content
Log in

Evaluation of session-based recommendation algorithms

  • Published:
User Modeling and User-Adapted Interaction Aims and scope Submit manuscript

Abstract

Recommender systems help users find relevant items of interest, for example on e-commerce or media streaming sites. Most academic research is concerned with approaches that personalize the recommendations according to long-term user profiles. In many real-world applications, however, such long-term profiles often do not exist and recommendations therefore have to be made solely based on the observed behavior of a user during an ongoing session. Given the high practical relevance of the problem, an increased interest in this problem can be observed in recent years, leading to a number of proposals for session-based recommendation algorithms that typically aim to predict the user’s immediate next actions. In this work, we present the results of an in-depth performance comparison of a number of such algorithms, using a variety of datasets and evaluation measures. Our comparison includes the most recent approaches based on recurrent neural networks like gru4rec, factorized Markov model approaches such as fism or fossil, as well as simpler methods based, e.g., on nearest neighbor schemes. Our experiments reveal that algorithms of this latter class, despite their sometimes almost trivial nature, often perform equally well or significantly better than today’s more complex approaches based on deep neural networks. Our results therefore suggest that there is substantial room for improvement regarding the development of more sophisticated session-based recommendation algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Adapted from Hidasi et al. (2016a)

Fig. 2

Adapted from Hidasi et al. (2016a)

Fig. 3

Adapted from Rendle et al. (2009)

Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. https://www.dropbox.com/sh/dbzmtq4zhzbj5o9/AACldzQWbw-igKjcPTBI6ZPAa?dl=0.

  2. Other weighting functions, e.g., with a logarithmic decay, are possible as well. Using the linear function however led to the best results, on average, in our experiments.

  3. The method was proposed by Hidasi et al. in the context of the gru4rec method.

  4. We use the implementation published at https://github.com/hidasib/GRU4Rec.

  5. We made additional experiments using other ways of encoding sequential information, e.g., by using embeddings of sessions and items with the popular Word2Vec and Doc2Vec approaches. However, none of these variations led to better accuracy results than the sknn method in our experiments. We therefore omit these results from our later discussions.

  6. Note that the weighting function is designed to work independently from the similarity function. We rely on the binary session representation for the similarity calculation without considering the order of the items to ensure computational efficiency.

  7. https://www.dropbox.com/sh/dbzmtq4zhzbj5o9/AACldzQWbw-igKjcPTBI6ZPAa?dl=0.

  8. To ensure that the smaller size of those splits does not negatively affect the performance of the model-based approaches, we tested the single-split configurations as well on all datasets. The obtained results are mostly in line with those obtained with the sliding-window protocol and shown in “Appendix D”.

  9. http://www.clef-newsreel.org/.

  10. https://www.sport1.de/.

  11. We provide additional results that were obtained for measurements taken at multiple list lengths in “Appendix B”.

  12. In the dataset, timestamps are only available at the granularity of days.

  13. We applied the Wilcoxon signed-rank test (\(\alpha =0.05\)) to determine the significance of differences between the two best performing approaches for each dataset.

  14. The other media datasets did not exhibit any notable particularities.

References

  • Adomavicius, G., Kwon, Y.O.: Improving aggregate recommendation diversity using ranking-based techniques. IEEE Trans. Knowl. Data Eng. 24(5), 896–911 (2012)

    Article  Google Scholar 

  • Adomavicius, G., Zhang, J.: Impact of data characteristics on recommender systems performance. ACM Trans. Manag. Inf. Syst. 3(1), 3:1–3:17 (2012)

    Article  Google Scholar 

  • Agrawal, R., Imieliński, T., Swami, A.: Mining association rules between sets of items in large databases. In: SIGMOD ’93, pp. 207–216 (1993)

    Article  Google Scholar 

  • Baeza-Yates, R., Jiang, D., Silvestri, F., Harrison, B.: Predicting the next app that you are going to use. In: WSDM ’15, pp. 285–294 (2015)

  • Billsus, D., Pazzani, M.J., Chen, J.: A learning agent for wireless news access. In: IUI ’00, pp. 33–36 (2000)

  • Bonnin, G., Jannach, D.: Automated generation of music playlists: survey and experiments. Comput. Surv. 47(2), 26:1–26:35 (2014)

    Article  Google Scholar 

  • Chen, S., Moore, J.L., Turnbull, D., Joachims, T.: Playlist prediction via metric embedding. In: KDD ’12, pp. 714–722 (2012)

  • Chen, S., Xu, J., Joachims, T.: Multi-space probabilistic sequence modeling. In: KDD ’13, pp. 865–873 (2013)

  • Cheng, C., Yang, H., Lyu, M.R., King, I.: Where you like to go next: Successive point-of-interest recommendation. In: IJCAI ’13, pp. 2605–2611 (2013)

  • Cho, K., van Merriënboer, B., Gülçehre, Ç, Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y.: Learning phrase representations using RNN encoder–decoder for statistical machine translation. In: EMNLP ’14, pp. 1724–1734 (2014)

  • Cooley, R., Mobasher, B., Srivastava, J.: Data preparation for mining world wide web browsing patterns. Knowl. Inf. Syst. 1(1), 5–32 (1999)

    Article  Google Scholar 

  • Das, A.S., Datar, M., Garg, A., Rajaram, S.: Google news personalization: Scalable online collaborative filtering. In: Proceedings of the 16th International Conference on World Wide Web, WWW ’07, pp. 271–280 (2007)

  • Davidson, J., Liebald, B., Liu, J., Nandy, P., Van Vleet, T., Gargi, U., Gupta, S., He, Y., Lambert, M., Livingston, B., Sampath, D.: The YouTube video recommendation system. In: RecSys ’10, pp. 293–296 (2010)

  • Devooght, R., Bersini, H.: Long and short-term recommendations with recurrent neural networks. In: UMAP ’17, pp. 13–21 (2017)

  • Djuric, N., Radosavljevic, V., Grbovic, M., Bhamidipati, N.: Hidden conditional random fields with deep user embeddings for ad targeting. In: ICDM ’14, pp. 779–784 (2014)

  • Du, N., Dai, H., Trivedi, R., Upadhyay, U., Gomez-Rodriguez, M., Song, L.: Recurrent marked temporal point processes: embedding event history to vector. In: KDD ’16, pp. 1555–1564 (2016)

  • Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12, 2121–2159 (2011)

    MathSciNet  MATH  Google Scholar 

  • Feng, S., Li, X., Zeng, Y., Cong, G., Chee, Y.M., Yuan, Q.: Personalized ranking metric embedding for next new POI recommendation. In: IJCAI ’15, pp. 2069–2075 (2015)

  • Garcin, F., Dimitrakakis, C., Faltings, B.: Personalized news recommendation with context trees. In: RecSys ’13, pp. 105–112 (2013)

  • Grbovic, M., Radosavljevic, V., Djuric, N., Bhamidipati, N., Savla, J., Bhagwan, V., Sharp, D.: E-commerce in your inbox: product recommendations at scale. In: KDD ’15, pp. 1809–1818 (2015)

  • Hariri, N., Mobasher, B., Burke, R.: Context-aware music recommendation based on latent topic sequential patterns. In: RecSys ’12, pp. 131–131 (2012)

  • He, R., McAuley, J.: Fusing similarity models with Markov Chains for sparse sequential recommendation. CoRR. (2016). arxiv:1609.09152

  • He, Q., Jiang, D., Liao, Z., Hoi, S.C.H., Chang, K., Lim, E.-P., Li, H.: Web query recommendation via sequential query prediction. In: ICDE ’09, pp. 1443–1454 (2009)

  • He, J., Li, X., Liao, L., Song, D., Cheung, W.: Inferring a personalized next point-of-interest recommendation model with latent behavior patterns. In: AAAI ’16 (2016)

  • Hidasi, B., Karatzoglou, A.: Recurrent neural networks with top-k gains for session-based recommendations. CoRR. (2017). arxiv:1706.03847

  • Hidasi, B., Karatzoglou, A., Baltrunas, L., Tikk, D.: Session-based recommendations with recurrent neural networks. In: ICLR ’16 (2016a)

  • Hidasi, B., Quadrana, M., Karatzoglou, A., Tikk, D.: Parallel recurrent neural network architectures for feature-rich session-based recommendations. In: RecSys ’16, pp. 241–248 (2016b)

  • Hosseinzadeh Aghdam, M., Hariri, N., Mobasher, B., Burke, R.: Adapting recommendations to contextual changes using hierarchical hidden Markov models. In: RecSys ’15, pp. 241–244 (2015)

  • Jannach, D., Hegelich, K.: A case study on the effectiveness of recommendations in the mobile internet. In: RecSys ’09, pp. 205–208 (2009)

  • Jannach, D., Adomavicius, G.: Recommendations with a purpose. In: RecSys ’16, pp. 7–10 (2016)

  • Jannach, D., Ludewig, M.: When recurrent neural networks meet the neighborhood for session-based recommendation. In: RecSys ’17, pp. 306–310 (2017)

  • Jannach, D., Lerche, L., Jugovac, M.: Adaptation and evaluation of recommendations for short-term shopping goals. In: RecSys ’15, pp. 211–218 (2015a)

  • Jannach, D., Lerche, L., Kamehkhosh, I., Jugovac, M.: What recommenders recommend: an analysis of recommendation biases and possible countermeasures. User Model. User Adapt. Interact. 25(5), 427–491 (2015b)

    Article  Google Scholar 

  • Jannach, D., Kamehkhosh, I., Lerche, L.: Leveraging multi-dimensional user models for personalized next-track music recommendation. In: ACM SAC 2017 (2017a)

  • Jannach, D., Ludewig, M., Lerche, L.: Session-based item recommendation in e-commerce: on short-term intents, reminders, trends, and discounts. User Model. User Adapt. Interact. 27(3–5), 351–392 (2017b)

    Article  Google Scholar 

  • Jugovac, M., Jannach, D., Karimi, M.: StreamingRec: a framework for benchmarking stream-based news recommenders. In: RecSys 2018 (2018)

  • Kabbur, S., Ning, X., Karypis, G.: FISM: factored item similarity models for top-n recommender systems. In: KDD ’13, pp. 659–667 (2013)

  • Kamehkhosh, I., Jannach, D., Ludewig, M.: A comparison of frequent pattern techniques and a deep learning method for session-based recommendation. In: TempRec Workshop at ACM RecSys ’17, Como, Italy (2017)

  • Karimi, M., Jannach, D., Jugovac, M.: News recommender systems—survey and roads ahead. Inf. Process. Manag. 54(6), 1203–1227 (2018)

  • Kingma, D.P., Adam, J.B.: A method for stochastic optimization. CoRR (2014). arxiv:1412.6980

  • Lee, D., Hosanagar, K.: Impact of recommender systems on sales volume and diversity. In: ICIS 2014 (2014)

  • Lerche, L., Jannach, D., Ludewig, M.: On the value of reminders within e-commerce recommendations. In: UMAP ’16, pp. 27–35 (2016)

  • Li, Z., Zhao, H., Liu, Q., Huang, Z., Mei, T., Chen, E.: Learning from history and present: next-item recommendation via discriminatively exploiting user behaviors. In: KDD 2018 (2018)

  • Lian, D., Zheng, V.W., Xie, X.: Collaborative filtering meets next check-in location prediction. In: WWW ’13, pp. 231–232 (2013)

  • Linden, G., Smith, B., York, J.: Amazon.com recommendations: item-to-item collaborative filtering. IEEE Internet Comput. 7(1), 76–80 (2003)

    Article  Google Scholar 

  • Liu, J., Dolan, P., Pedersen, E.R.: Personalized news recommendation based on click behavior. In: IUI ’10, pp. 31–40 (2010)

  • Liu, Y., Liu, C., Liu, B., Qu, M., Xiong, H.: Unified point-of-interest recommendation with temporal interval assessment. In: KDD ’16, pp. 1015–1024 (2016)

  • Liu, Q., Zeng, Y., Mokhosi, R., Zhang, H.: STAMP: short-term attention/memory priority model for session-based recommendation. In: KDD 2018 (2018)

  • Ludmann, C.A.: Recommending news articles in the CLEF news recommendation evaluation lab with the data stream management system odysseus. In: Working Notes of CLEF 2017—Conference and Labs of the Evaluation (2017)

  • McFee, B., Lanckriet, G.: The natural language of playlists. In: ISMIR ’11, pp. 537–541 (2011)

  • McFee, B., Lanckriet, G.R.G.: Hypergraph models of playlist dialects. In: ISMIR ’12, pp. 343–348 (2012)

  • Mobasher, B., Dai, H., Luo, T., Nakagawa, M.: Using sequential and non-sequential patterns in predictive web usage mining tasks. In: ICDM ’02, pp. 669–672 (2002)

  • Moling, O., Baltrunas, L., Ricci, F.: Optimal radio channel recommendations with explicit and implicit feedback. In: RecSys ’12, pp. 75–82 (2012)

  • Natarajan, N., Shin, D., Dhillon, I.S.: Which app will you use next? Collaborative filtering with interactional context. In: RecSys ’13, pp. 201–208 (2013)

  • Norris, J.R.: Markov Chains. Cambridge University Press, Cambridge (1997)

    Book  Google Scholar 

  • Quadrana, M., Karatzoglou, A., Hidasi, B., Cremonesi, P.: Personalizing session-based recommendations with hierarchical recurrent neural networks. In: RecSys ’17 (2017)

  • Quadrana, M., Cremonesi, P., Jannach, D.: Sequence-aware recommender systems. ACM Comput. Surv. 54, 1–36 (2018)

    Article  Google Scholar 

  • Reddy, S., Labutov, I., Joachims, T.: Learning student and content embeddings for personalized lesson sequence recommendation. In: ACM Learning @ Scale ’16, pp. 93–96 (2016)

  • Rendle, S., Freudenthaler, C., Gantner, Z., Schmidt-Thieme, L.: BPR: Bayesian personalized ranking from implicit feedback. In: UAI ’09, pp. 452–461 (2009)

  • Rendle, S., Freudenthaler, C., Schmidt-Thieme, L.: Factorizing personalized Markov Chains for next-basket recommendation. In: WWW ’10, pp. 811–820 (2010)

  • Shani, G., Heckerman, D., Brafman, R.I.: An MDP-based recommender system. J. Mach. Learn. Res. 6, 1265–1295 (2005)

    MathSciNet  MATH  Google Scholar 

  • Soh, H., Sanner, S., White, M., Jamieson, G.: Deep sequential recommendation for personalized adaptive user interfaces. In: IUI ’17, pp. 589–593 (2017)

  • Song, Q., Cheng, J., Yuan, T., Lu, H.: Personalized recommendation meets your next favorite. In: CIKM ’15, pp. 1775–1778 (2015)

  • Song, Y., Elkahky, A.M., He, X.: Multi-rate deep learning for temporal recommendation. In: SIGIR ’16, pp. 909–912 (2016)

  • Sordoni, A., Bengio, Y., Vahabi, H., Lioma, C., Grue Simonsen, J., Nie, J.-Y.: A hierarchical recurrent encoder-decoder for generative context-aware query suggestion. In: CIKM ’15, pp. 553–562 (2015)

  • Tagami, Y., Kobayashi, H., Ono, S., Tajima, A.: Modeling user activities on the web using paragraph vector. In: WWW ’15, pp. 125–126 (2015)

  • Tan, Y.K., Xu, X., Liu, Y.: Improved recurrent neural networks for session-based recommendations. In: DLRS ’16 Workshop at ACM RecSys, pp. 17–22 (2016)

  • Tavakol, M., Brefeld, U.: Factored MDPs for detecting topics of user sessions. In: RecSys ’14, pp. 33–40 (2014)

  • Turrin, R., Quadrana, M., Condorelli, A., Pagano, R., Cremonesi, P.: 30music listening and playlists dataset. In: Poster Proceedings of RecSys ’15 (2015)

  • Twardowski, B.: Modelling contextual information in session-aware recommender systems with neural networks. In: RecSys ’16, pp. 273–276 (2016)

  • Vasile, F., Smirnova, E., Conneau, A.: Meta-prod2vec: product embeddings using side-information for recommendation. In: RecSys ’16, pp. 225–232 (2016)

  • Verstrepen, K., Goethals, B.: Unifying nearest neighbors collaborative filtering. In: RecSys ’14, pp. 177–184 (2014)

  • Wu, X., Liu, Q., Chen, E., He, L., Lv, J., Cao, C., Hu, G.: Personalized next-song recommendation in online karaokes. In: RecSys ’13, pp. 137–140 (2013)

  • Yap, G.-E., Li, X.-L., Yu, P.S.: Effective next-items recommendation via personalized sequential pattern mining. In: DASFAA ’12, Volume Part II, pp. 48–64 (2012)

    Chapter  Google Scholar 

  • Yu, F., Liu, Q., Wu, S., Wang, L., Tan, T.: A dynamic recurrent model for next basket recommendation. In: SIGIR ’16, pp. 729–732 (2016)

  • Zangerle, E., Pichl, M., Gassler, W., Specht, G.: #nowplaying music dataset: extracting listening behavior from Twitter. In: WISMM ’14 Workshop at MM ’14, pp. 21–26 (2014)

  • Zeiler, M.D.: ADADELTA: an adaptive learning rate method. CoRR (2012). arxiv:1212.5701

  • Zhang, Y., Dai, H., Xu, C., Feng, J., Wang, T., Bian, J., Wang, B., Liu, T.-Y.: Sequential click prediction for sponsored search with recurrent neural networks. In: AAAI ’14, pp. 1369–1375 (2014)

  • Zheleva, E., Guiver, J., Mendes Rodrigues, E., Milić-Frayling, N.: Statistical models of music-listening sessions in social media. In: WWW ’10, pp. 1019–1028 (2010)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Malte Ludewig.

Additional information

A preliminary comparison of sequential recommendation algorithms was presented in our own previous work in Jannach and Ludewig (2017), Kamehkhosh et al. (2017) and a pre-print version of this work is available at https://arxiv.org/abs/1803.09587.

Appendices

Parameter configurations

See Tables 9, 10, 11 and 12.

Table 9 Parameters for algorithm gru4rec for all datasets
Table 10 Parameters used for the smf algorithm for all datasets
Table 11 Parameters used for the v-sknn algorithm for all datasets
Table 12 Parameters used for the sknn, s-sknn, and sf-sknn algorithm for all datasets

Full result tables

See Tables 13, 14, 15, 16, 17, 18, 19, 20, 21 and 22.

Table 13 Hit rate (HR), mean reciprocal rank (MRR), item coverage (COV), and average popularity (POP) results for a list length of 20, 10, 5, 3, and 1 on the RSC15 dataset (sorted by MRR@20)
Table 14 Hit rate (HR), mean reciprocal rank (MRR), item coverage (COV), and average popularity (POP) results for a list length of 20, 10, 5, 3, and 1 on the RSC15-S dataset (sorted by MRR@20)
Table 15 Hit rate (HR), mean reciprocal rank (MRR), item coverage (COV), and average popularity (POP) results for a list length of 20, 10, 5, 3, and 1 on the TMALL dataset (sorted by MRR@20)
Table 16 Hit rate (HR), mean reciprocal rank (MRR), item coverage (COV), and average popularity (POP) results for a list length of 20, 10, 5, 3, and 1 on the RETAILROCKET dataset (sorted by MRR@20)
Table 17 Hit rate (HR), mean reciprocal rank (MRR), item coverage (COV), and average popularity (POP) results for a list length of 20, 10, 5, 3, and 1 on the ZALANDO dataset (sorted by MRR@20)
Table 18 Hit rate (HR), mean reciprocal rank (MRR), item coverage (COV), and average popularity (POP) results for a list length of 20, 10, 5, 3, and 1 on the 8TRACKS dataset (sorted by MRR@20)
Table 19 Hit rate (HR), mean reciprocal rank (MRR), item coverage (COV), and average popularity (POP) results for a list length of 20, 10, 5, 3, and 1 on the AOTM dataset (sorted by MRR@20)
Table 20 Hit rate (HR), mean reciprocal rank (MRR), item coverage (COV), and average popularity (POP) results for a list length of 20, 10, 5, 3, and 1 on the 30MUSIC dataset (sorted by MRR@20)
Table 21 Hit rate (HR), mean reciprocal rank (MRR), item coverage (COV), and average popularity (POP) results for a list length of 20, 10, 5, 3, and 1 on the NOWPLAYING dataset (sorted by MRR@20)
Table 22 Hit rate (HR), mean reciprocal rank (MRR), item coverage (COV), and average popularity (POP) results for a list length of 20, 10, 5, 3, and 1 on the CLEF dataset (sorted by MRR@20)

Additional results for precision and recall

See Tables 23, 24, 25, 26, 27, 28, 29 and 30.

Table 23 Precision (P) and recall (R) results for a list length of 20, 10, 5, and 3 on the TMALL dataset (sorted by P@20)
Table 24 Precision (P) and recall (R) results for a list length of 20, 10, 5, and 3 on the RETAILROCKET dataset (sorted by P@20)
Table 25 Precision (P) and recall (R) results for a list length of 20, 10, 5, and 3 on the ZALANDO dataset (sorted by P@20)
Table 26 Precision (P) and recall (R) results for a list length of 20, 10, 5, and 3 on the 8TRACKS dataset (sorted by P@20)
Table 27 Precision (P) and recall (R) results for a list length of 20, 10, 5, and 3 on the AOTM dataset (sorted by P@20)
Table 28 Precision (P) and recall (R) results for a list length of 20, 10, 5, and 3 on the 30MUSIC dataset (sorted by P@20)
Table 29 Precision (P) and recall (R) results for a list length of 20, 10, 5, and 3 on the NOWPLAYING dataset (sorted by P@20)
Table 30 Precision (P) and recall (R) results for a list length of 20, 10, 5, and 3 on the CLEF dataset (sorted by P@20)

Additional single split results

See Tables 31, 32, 33, 34, 35, 36, 37 and 38.

Table 31 Hit rate (HR), mean reciprocal rank (MRR), item coverage (COV), and average popularity (POP) results for a list length of 20, 10, 5, 3, and 1 on the TMALL dataset with a single split (sorted by MRR@20)
Table 32 Hit rate (HR), mean reciprocal rank (MRR), item coverage (COV), and average popularity (POP) results for a list length of 20, 10, 5, 3, and 1 on the RETAILROCKET dataset with a single split (sorted by MRR@20)
Table 33 Hit rate (HR), mean reciprocal rank (MRR), item coverage (COV), and average popularity (POP) results for a list length of 20, 10, 5, 3, and 1 on the ZALANDO dataset with a single split (sorted by MRR@20)
Table 34 Hit rate (HR), mean reciprocal rank (MRR), item coverage (COV), and average popularity (POP) results for a list length of 20, 10, 5, 3, and 1 on the 8TRACKS dataset with a single split (sorted by MRR@20)
Table 35 Hit rate (HR), mean reciprocal rank (MRR), item coverage (COV), and average popularity (POP) results for a list length of 20, 10, 5, 3, and 1 on the AOTM dataset with a single split (sorted by MRR@20)
Table 36 Hit rate (HR), mean reciprocal rank (MRR), item coverage (COV), and average popularity (POP) results for a list length of 20, 10, 5, 3, and 1 on the 30MUSIC dataset with a single split (sorted by MRR@20)
Table 37 Hit rate (HR), mean reciprocal rank (MRR), item coverage (COV), and average popularity (POP) results for a list length of 20, 10, 5, 3, and 1 on the NOWPLAYING dataset with a single split (sorted by MRR@20)
Table 38 Hit rate (HR), mean reciprocal rank (MRR), item coverage (COV), and average popularity (POP) results for a list length of 20, 10, 5, 3, and 1 on the CLEF dataset with a single split (sorted by MRR@20)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ludewig, M., Jannach, D. Evaluation of session-based recommendation algorithms. User Model User-Adap Inter 28, 331–390 (2018). https://doi.org/10.1007/s11257-018-9209-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11257-018-9209-6

Keywords

Navigation