Skip to main content
Log in

Visual Interestingness Prediction: A Benchmark Framework and Literature Review

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

In this paper, we report on the creation of a publicly available, common evaluation framework for image and video visual interestingness prediction. We propose a robust data set, the Interestingness10k, with 9831 images and more than 4 h of video, interestigness scores determined based on more than 1M pair-wise annotations of 800 trusted annotators, some pre-computed multi-modal descriptors, and 192 system output results as baselines. The data were validated extensively during the 2016–2017 MediaEval benchmark campaigns. We provide an in-depth analysis of the crucial components of visual interestingness prediction algorithms by reviewing the capabilities and the evolution of the MediaEval benchmark systems, as well as of prominent systems from the literature. We discuss overall trends, influence of the employed features and techniques, generalization capabilities and the reliability of results. We also discuss the possibility of going beyond state-of-the-art performance via an automatic, ad-hoc system fusion, and propose a deep MLP-based architecture that outperforms the current state-of-the-art systems by a large margin. Finally, we provide the most important lessons learned and insights gained.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

Notes

  1. https://www.technicolor.com/.

  2. http://www.multimediaeval.org/.

  3. http://host.robots.ox.ac.uk/pascal/VOC/.

  4. http://www.image-net.org/challenges/LSVRC/.

  5. https://trecvid.nist.gov/.

  6. https://www.imageclef.org/.

  7. https://www.mturk.com/.

  8. https://www.flickr.com/.

  9. https://www.youtube.com/.

  10. https://scholar.google.com/.

  11. The Interestingness10k data set is available for download here: https://www.interdigital.com/data_sets/interestingness-dataset.

  12. https://creativecommons.org/.

  13. A video shot is a sequence of images recorded continuously between a camera turn on and off.

  14. The web-based pair-wise annotation software tool is available here: https://github.com/mvsjober/pair-annotate.

  15. https://trec.nist.gov/trec_eval/.

  16. https://www.imdb.com/.

  17. https://www.flickr.com/services/api/flickr.interestingness.getList.html.

Abbreviations

API:

Application programming interface

BN:

Batch normalization

BTL:

Bradley-Terry-Luce

CNN:

Convolutional neural networks

C3D:

Convolutional 3-dimensional

CSP-RNN:

Circular state-passing recurrent neural network,

DNN:

Deep neural networks,

GMM:

Gaussian mixture models,

HMM:

Hidden Markov models,

HoG:

Histograms of oriented gradients,

HMP:

Histogram of motion patterns,

HSV:

Hue-saturation-value,

kNN:

k-nearest neighbours,

LBP:

Local binary patterns,

LSTM:

Long short-term memory,

MLP:

Multi-layer perceptron,

MFCC:

Mel-frequency cepstral coefficients,

mAP:

Mean Average Precision,

NMMP:

Neighborhood minmax projections,

NN:

Neural network,

PCA:

Principal component analysis,

SIFT:

Scale invariant feature transform,

SVM:

Support vector machines,

SMR:

Supervised manifold regression,

VOD:

Video on demand,

VSEM:

Visual-semantic embedding model.

References

  • Abdi H.(2007). “The kendall rank correlation coefficient,” Encyclopedia of measurement and statistics. Sage, pp. 508–510.

  • Ahmed, O. B., Wacker, J., Gaballo, A., & Huet, B. (2017). Eurecom@mediaeval 2017: Media genre inference for predicting media interestingness. In MediaEval workshop, Dublin, Ireland, September 13-15., (Vol. 1984), CEUR-WS.org.

  • Almeida, J. (2016) UNIFESP at mediaeval 2016: Predicting media interestingness task. In MediaEval workshop, Hilversum, The Netherlands, October 20-21. (Vol. 1739), CEUR-WS.org.

  • Almeida, J., & Savii, R. M. (2017). GIBIS at mediaeval 2017: Predicting media interestingness task. In: MediaEval workshop, Dublin, Ireland, September 13-15., (Vol. 1984), CEUR-WS.org.

  • Almeida, J., Leite, N. J., & Torres, R. d. S. (2011). Comparison of video sequences with histograms of motion patterns. In 18th IEEE international conference on image processing, pp. 3673–3676, IEEE.

  • Almeida, J., Valem, L. P., & Pedronette, D. C. (2017) A rank aggregation framework for video interestingness prediction. In International conference on image analysis and processing, pp. 3–14, Springer.

  • Awad, G., Over, P., & Kraaij, W. (2014). Content-based video copy detection benchmarking at trecvid. ACM Transactions on Information Systems (TOIS), 32(3), 14.

    Article  Google Scholar 

  • Aytar, Y., Vondrick, C., & Torralba, A. (2016). Soundnet: Learning sound representations from unlabeled video, In Advances in neural information processing systems 29: annual conference on neural information processing systems, December 5–10 (pp. 892–900). Spain: Barcelona.

  • Bakhshi, S., Shamma, D. A., Kennedy, L., Song, Y., De Juan, P., & Kaye, J. (2016) Fast, cheap, and good: Why animated gifs engage us. In Proceedings of the chi conference on human factors in computing systems, pp. 575–586, ACM

  • Berlyne, D. E. (1949). Interest as a psychological concept. British Journal of Psychology. General Section, 39(4), 184–195.

    Article  Google Scholar 

  • Berlyne, D. E. (1960). Conflict, arousal, and curiosity. New York: McGraw-Hill Book Company.

    Book  Google Scholar 

  • Berlyne, D. E. (1970). Novelty, complexity, and hedonic value. Perception & Psychophysics, 8(5), 279–286.

    Article  Google Scholar 

  • Berson, E., Demarty, C., & Duong, N. Q. K. (2017). Multimodality and deep learning when predicting media interestingness. In MediaEval workshop, Dublin, Ireland, September 13-15. (Vol. 1984), CEUR-WS.org.

  • Borth, D., Chen, T., Ji, R., & Chang S.-F. (2013). Sentibank: large-scale ontology and classifiers for detecting sentiment and emotions in visual content. In Proceedings of the 21st ACM international conference on Multimedia, pp. 459–460, ACM.

  • Bradley, R. A., & Terry, M. E. (1952). Rank analysis of incomplete block designs: the method of paired comparisons. Biometrika, 39(3–4), 324–345.

    MathSciNet  MATH  Google Scholar 

  • Buckley, C., & Voorhees, E. M. (2017). Evaluating evaluation measure stability. SIGIR Forum, 51(2), 235–242.

    Article  Google Scholar 

  • Carballal, A., Fernandez-Lozano, C., Heras, J., & Romero, J. (2019). Transfer learning features for predicting aesthetics through a novel hybrid machine learning method. Neural Computing and Applications, 1–12.

  • Chamaret, C., Demarty, C.-H., Demoulin, V., & Marquant, G. (2016). Experiencing the interestingness concept within and between pictures. Electronic Imaging, 2016(16), 1–12.

    Article  Google Scholar 

  • Constantin, M. G., Boteanu, B. A., & Ionescu, B. (2017). Lapi at mediaeval 2017-predicting media interestingness. In MediaEval workshop, Dublin, Ireland, September 13-15. (Vol. 1984), CEUR-WS.org.

  • Constantin, M. G., Redi, M., Zen, G., & Ionescu, B. (2019). Computational understanding of visual interestingness beyond semantics: Literature survey and analysis of covariates. ACM Computing Surveys.

  • Constantin, M. G., & Ionescu, B. (2017). Content description for predicting image interestingness. In 2017 international symposium on signals, circuits and systems (ISSCS), pp. 1–4, IEEE, 13–14 July.

  • Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In International conference on computer vision & pattern recognition, (Vol. 1), pp. 886–893, IEEE Computer Society.

  • Danelljan, M., Häger, G., Khan, F., Felsberg, M. (2014). Accurate scale estimation for robust visual tracking. In British machine vision conference, nottingham, September 1-5, BMVA Press.

  • Datta, R., Joshi, D., Li, J., & Wang, J.Z. (2006). Studying aesthetics in photographic images using a computational approach. In European conference on computer vision, pp. 288–301, Springer.

  • Demarty, C.-H., Sjöberg, M., Constantin, M. G., Duong, N. Q., Ionescu, B., Do, T.-T., & Wang, H. (2017). Predicting interestingness of visual content. In Visual content indexing and retrieval with psycho-visual models, pp. 233–265, Cham: Springer.

  • Demarty, C.-H., Sjöberg, M., Ionescu, B., Do, T.-T., Gygli, M., & Duong, N. Q. K. (2017). Mediaeval 2017 predicting media interestingness task. In MediaEval Workshop, Dublin, Ireland, September 13-15. (Vol. 1984), CEUR-WS.org.

  • Demarty, C.-H., Sjöberg, M., Ionescu, B., Do, T.-T., Wang, H., Duong, N. Q. K., & Lefebvre, F. (2016). Mediaeval 2016 predicting media interestingness task. In MediaEval workshop, Hilversum, The Netherlands, October 20-21. (Vol. 1739), CEUR-WS.org

  • Deselaers, T., Deserno, T. M., & Müller, H. (2008). Automatic medical image annotation in imageclef 2007: Overview, results, and discussion. Pattern Recognition Letters, 29(15), 1988–1995.

    Article  Google Scholar 

  • Erdogan, G., Erdem, A., & Erdem, E. (2016). HUCVL at mediaeval 2016: Predicting interesting key frames with deep models. In MediaEval workshop, Hilversum, The Netherlands, October 20-21. (Vol. 1739), CEUR-WS.org.

  • Everingham, M., Eslami, S. A., Van Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2015). The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision, 111(1), 98–136.

    Article  Google Scholar 

  • Eyben, F., Wöllmer, M., & Schuller, B. (2010) Opensmile: the munich versatile and fast open-source audio feature extractor. In Proceedings of the 18th ACM international conference on multimedia, pp. 1459–1462, ACM.

  • Freund, Y., & Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1), 119–139.

    Article  MathSciNet  MATH  Google Scholar 

  • Friedman, J. H. (2001). Greedy function approximation: a gradient boosting machine. Annals of Statistics, 1189–1232.

  • Ghadiyaram, D., Tran, D., & Mahajan, D. (2019). Large-scale weakly-supervised pre-training for video action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 12046–12055.

  • Goyal, R., Kahou, S. E., Michalski, V., Materzynska, J., Westphal, S., Kim, H., Haenel, V., Fruend, I., Yianilos, P., & Mueller-Freitag, M. et al., (2017). The something something video database for learning and evaluating visual common sense. In ICCV, (Vol. 1), p. 5

  • Grabner, H., Nater, F., Druey, M., & Van Gool, L. (2013). Visual interestingness in image sequences. In Proceedings of the 21st ACM international conference on Multimedia, pp. 1017–1026, ACM.

  • Gygli, M., & Soleymani, M. (2016). Analyzing and predicting gif interestingness. In Proceedings of the 24th ACM international conference on Multimedia, pp. 122–126, ACM.

  • Gygli, M., Grabner, H., Riemenschneider, H., Nater, F., & Van Gool, L. (2013) The interestingness of images. In Proceedings of the IEEE international conference on computer vision, pp. 1633–1640, IEEE.

  • Gygli, M., Song, Y., & Cao, L. (2016). Video2gif: Automatic generation of animated gifs from video,” In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1001–1009, IEEE.

  • Han, S., Meng, Z., Khan, A.-S., & Tong, Y. (2016). Incremental boosting convolutional neural network for facial action unit recognition. In Advances in neural information processing systems, 109–117.

  • Hayes, A. F., & Krippendorff, K. (2007). Answering the call for a standard reliability measure for coding data. Communication Methods and Measures, 1(1), 77–89.

    Article  Google Scholar 

  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770–778.

  • Hidi, S., & Anderson, V. (1992). Situational interest and its impact on reading and expository writing. The Role of Interest in Learning and Development, 11, 213–214.

    Google Scholar 

  • Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.

    Article  Google Scholar 

  • Hsieh, L.-C., Hsu, W. H., & Wang, H.-C. (2014). Investigating and predicting social and visual image interestingness on social media by crowdsourcing. In IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 4309–4313, IEEE.

  • Hua, X.-S., Yang, L., Wang, J., Wang, J., Ye, M., Wang, K., Rui, Y., & Li, J. (2013). Clickage: Towards bridging semantic and intent gaps via mining click logs of search engines. In Proceedings of the 21st ACM international conference on Multimedia, pp. 243–252.

  • Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., & Darrell, T. (2014). Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international conference on Multimedia, pp. 675–678, ACM.

  • Jiang, Y.-G., Wang, Y., Feng, R., Xue, X., Zheng, Y., & Yang, H. (2013). Understanding and predicting interestingness of videos. In Twenty-Seventh AAAI conference on artificial intelligence, pp. 1–7.

  • Jiang, Y.-G., Dai, Q., Mei, T., Rui, Y., & Chang, S.-F. (2015). Super fast event recognition in internet videos. IEEE Transactions on Multimedia, 17(8), 1174–1186.

    Article  Google Scholar 

  • Kalpathy-Cramer, J., de Herrera, A. G. S., Demner-Fushman, D., Antani, S., Bedrick, S., & Müller, H. (2015). Evaluating performance of biomedical image retrieval systems-an overview of the medical image retrieval task at imageclef 2004–2013. Computerized Medical Imaging and Graphics, 39, 55–61.

    Article  Google Scholar 

  • Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., Natsev, P. et al., (2017). The kinetics human action video dataset. arXiv preprint arXiv:1705.06950.

  • Ke, Y., Tang, X., & Jing, F. (2006). The design of high-level features for photo quality assessment. In IEEE computer society conference on computer vision and pattern recognition (Vol. 1), pp. 419–426, IEEE.

  • Khosla, A., Raju, A. S., Torralba, A., & Oliva, A. (2015). Understanding and predicting image memorability at a large scale. Proceedings of the IEEE international conference on computer vision, 2390–2398.

  • Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. In 3rd International conference on learning representations, San Diego, CA, USA, May 7-9, conference track proceedings.

  • Kiros, R., Salakhutdinov, R., & Zemel, R. S. (2014). “Unifying visual-semantic embeddings with multimodal neural language models,” arXiv preprint arXiv:1411.2539.

  • Kittler, J., Hater, M., Duin, R. P. (1996). Combining classifiers. In Proceedings of 13th international conference on pattern recognition, (Vol. 2), pp. 897–901, IEEE.

  • Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems, 1097–1105.

  • Lam, V., Do, T., Phan, S., Le, D.-D., Satoh, S., & Duong, D. A. (2016). Nii-uit at mediaeval 2016 predicting media interestingness task. In MediaEval Workshop, Hilversum, The Netherlands, October 20-21. (Vol. 1739), CEUR-WS.org.

  • Lazebnik, S., Schmid, C., & Ponce, J. (2006). Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In IEEE computer society conference on computer vision and pattern recognition, (Vol. 2), pp. 2169–2178, IEEE.

  • Li, J., Barkowsky, M., & Callet, P. L. (2013). Boosting paired comparison methodology in measuring visual discomfort of 3dtv: performances of three different designs. In Proceedings of SPIE electronic imaging, stereoscopic displays and applications (Vol. 8648).

  • Li, X., Huo, Y., Jin, Q., & Xu, J. (2016). Detecting violence in video using subclasses. In Proceedings of the 2016 ACM conference on multimedia conference, MM 2016, pp. 586–590, ACM, October 15-19.

  • Li, C., & Chen, T. (2009). Aesthetic visual quality assessment of paintings. IEEE Journal of Selected Topics in Signal Processing, 3(2), 236–252.

    Article  Google Scholar 

  • Liem, C. (2016). “TUD-MMC at mediaeval 2016: Predicting media interestingness task. In MediaEval workshop, Hilversum, The Netherlands, October 20-21. (Vol. 1739), CEUR-WS.org.

  • Liu, Y., Gu, Z., & Ko, T. H. (2017). Predicting media interestingness via biased discriminant embedding and supervised manifold regression. In MediaEval workshop, Dublin, Ireland, September 13-15. (Vol. 1984), CEUR-WS.org.

  • Liu, Y., Gu, Z., Ko, T. H., & Hua, K. A. (2018). Learning perceptual embeddings with two related tasks for joint predictions of media interestingness and emotions. In Proceedings of the ACM on international conference on multimedia retrieval, pp. 420–427, ACM.

  • Liu, F., Niu, Y., & Gleicher M. (2009). Using web photos for measuring video frame interestingness. In Twenty-First international joint conference on artificial intelligence.

  • Liu, C., Zoph, B., Neumann, M., Shlens, J., Hua, W., Li, L.-J., et al. (2018). Progressive neural architecture search. In Proceedings of the European conference on computer vision (ECCV), 19–34.

  • Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2), 91–110.

    Article  Google Scholar 

  • Mann, H. B., & Whitney, D. R. (1947). On a test of whether one of two random variables is stochastically larger than the other. The Annals of Mathematical Statistics, pp. 50–60.

  • McCrae, R. R. (2007). Aesthetic chills as a universal marker of openness to experience. Motivation and Emotion, 31(1), 5–11.

    Article  Google Scholar 

  • Mo, S., Niu, J., Su, Y., & Das, S. K. (2018). A novel feature set for video emotion recognition. Neurocomputing, 291, 11–20.

    Article  Google Scholar 

  • Ojala, T., Pietikäinen, M., & Mäenpää, T. (2002). Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Transactions on Pattern Analysis & Machine Intelligence, 7, 971–987.

    Article  MATH  Google Scholar 

  • Oliva, A., & Torralba, A. (2001). Modeling the shape of the scene: A holistic representation of the spatial envelope. International Journal of Computer Vision, 42(3), 145–175.

    Article  MATH  Google Scholar 

  • Opitz, M., Waltner, G., Possegger, H., & Bischof, H. (2017). Bier-boosting independent embeddings robustly. In Proceedings of the IEEE international conference on computer vision, 5189–5198.

  • Ovadia, S. (2004). Ratings and rankings: reconsidering the structure of values and their measurement. International Journal of Social Research Methodology, 7(5), 403–414.

    Article  Google Scholar 

  • Parekh, J., Tibrewal, H., & Parekh, S. (2018). Deep pairwise classification and ranking for predicting media interestingness. In Proceedings of the 2018 ACM on international conference on multimedia retrieval, ICMR, Yokohama, Japan, June 11-14., pp. 428–433, ACM.

  • Permadi, R. A., Putra, S. G. P., Helmiriawan, & Liem C. C. S. (2017). DUT-MMSR at mediaeval 2017: Predicting media interestingness task. In MediaEval workshop, Dublin, Ireland, September 13-15. (Vol. 1984), CEUR-WS.org.

  • Poignant, J., Bredin, H., & Barras, C. (2017). Multimodal person discovery in broadcast tv: lessons learned from mediaeval 2015. Multimedia Tools and Applications, 76(21), 22547–22567.

    Article  Google Scholar 

  • Randolph, J. J. (2005). “Free-marginal multirater kappa (multirater k free): an alternative to fleiss’ fixed-marginal multirater kappa”, In Joensuu learning and instruction symposium. Finland: Joensuu.

  • Rayatdoost, S., & Soleymani, M. (2016). Ranking images and videos on visual interestingness by visual sentiment features. In MediaEval workshop, Hilversum, The Netherlands, October 20-21. (Vol. 1739), CEUR-WS.org.

  • Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., et al. (2015). Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3), 211–252.

    Article  MathSciNet  Google Scholar 

  • Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., et al. (2015). ImageNet large scale visual recognition challenge. International Journal of Computer Vision (IJCV), 115(3), 211–252.

    Article  MathSciNet  Google Scholar 

  • Salesses, P., Schechtner, K., & Hidalgo, C. A. (2013). The collaborative image of the city: mapping the inequality of urban perception PloS one 8(7).

  • Sanderson, M., & Zobel, J. (2005). Information retrieval system evaluation: effort, sensitivity, and reliability. In Proceedings of the 28th annual international ACM SIGIR conference on research and development in information retrieval, pp. 162–169, ACM, August 15-19.

  • Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., & Batra, D. (2017). Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, 618–626.

  • Shen, Y., Demarty, C.-H., & Duong, N. Q. K. (2017). Deep learning for multimodal-based video interestingness prediction. In IEEE international conference on multimedia and expo (ICME), pp. 1003–1008, IEEE.

  • Shen, Y., Demarty, C., Duong, N. Q. K. (2016). Technicolor@mediaeval 2016 predicting media interestingness task. In MediaEval workshop, Hilversum, The Netherlands, October 20-21., (Vol. 1739), CEUR-WS.org.

  • Silvia, P. J. (2005). What is interesting? exploring the appraisal structure of interest. Emotion, 5(1), 89.

    Article  Google Scholar 

  • Silvia, P. J. (2009). Looking past pleasure: anger, confusion, disgust, pride, surprise, and other unusual aesthetic emotions. Psychology of Aesthetics, Creativity, and the Arts, 3(1), 48.

    Article  MathSciNet  Google Scholar 

  • Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.

  • Sivaraman, K., & Somappa, G. (2016). Moviescope: Movie trailer classification using deep neural networks. University of Virginia.

  • Smeaton, A. F., Over, P., & Doherty, A. R. (2010). Video shot boundary detection: Seven years of trecvid activity. Computer Vision and Image Understanding, 114(4), 411–418.

    Article  Google Scholar 

  • Soleymani, M. (2015) The quest for visual interest. In Proceedings of the 23rd ACM international conference on multimedia, pp. 919–922, ACM.

  • Son, J., Jung, I., Park, K., & Han, B. (2015). Tracking-by-segmentation with online gradient boosting decision tree. In Proceedings of the IEEE international conference on computer vision, 3056–3064.

  • Springenberg, J. T., Dosovitskiy, A., Brox, T., Riedmiller, M. (2014). Striving for simplicity: The all convolutional net. arXiv preprint arXiv:1412.6806.

  • Squalli-Houssaini, H., Duong, N. Q. K., Gwenaëlle, M., & Demarty, C.-H. (2018). Deep learning for predicting image memorability. In IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 2371–2375, IEEE

  • Sudhakaran, S., Escalera, S., & Lanz, O. (2020). Gate-shift networks for video action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition.

  • Touvron, H., Vedaldi, A., Douze, M., & Jégou, H. (2019). Fixing the train-test resolution discrepancy. Advances in Neural Information Processing Systems, 8250–8260.

  • Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. (2015). Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, pp. 4489–4497, IEEE.

  • Tran, D., Wang, H., Torresani, L., & Feiszli, M. (2019). Video classification with channel-separated convolutional networks. In Proceedings of the IEEE international conference on computer vision, 5552–5561.

  • Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., & Paluri, M. (2018). A closer look at spatiotemporal convolutions for action recognition. Proceedings of the IEEE conference on computer vision and pattern recognition, 6450–6459.

  • Urbano, J., Marrero, M., & Martín, D. (2013) On the measurement of test collection reliability. In The 36th International ACM SIGIR conference on research and development in information retrieval, pp. 393–402, ACM, July 28 - August 1.

  • Vasudevan, A. B., Gygli, M., Volokitin, A., & Van Gool, L. (2016). Eth-cvl@ mediaeval 2016: Textual-visual embeddings and video2gif for video interestingness. In MediaEval workshop, Hilversum, The Netherlands, October 20-21., (Vol. 1739), CEUR-WS.org.

  • Vigna, S. (2015). A weighted correlation index for rankings with ties. In Proceedings of the 24th international conference on World Wide Web, WWW Eds. A. Gangemi, S. Leonardi, and A. Panconesi, pp. 1166–1176, ACM, May 18-22.

  • Voorhees, E. M. (1998). Variations in relevance judgments and the measurement of retrieval effectiveness In Proceedings of the 21st annual international ACM SIGIR conference on research and development in information retrieval. Eds. W. B. Croft, A. Moffat, C. J. van Rijsbergen, R. Wilkinson, and J. Zobel, pp. 315–323, ACM, August 24-28.

  • Wang, S., Chen, S., Zhao, J., & Jin, Q. (2018). Video interestingness prediction based on ranking model. In Proceedings of the joint workshop of the 4th workshop on affective social multimedia computing and first multi-modal affective computing of large-scale multimedia data, ASMMC-MMAC’18, pp. 55–61, ACM.

  • Xiao, J., Hays, J., Ehinger, K. A., Oliva, A., & Torralba, A. (2010). Sun database: Large-scale scene recognition from abbey to zoo. In IEEE computer society conference on computer vision and pattern recognition, pp. 3485–3492, IEEE.

  • Xie, S., Girshick, R., Dollár, P., Tu, Z., & He, K. (2017). Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, 1492–1500.

  • Xu, B., Fu, Y., & Jiang, Y. (2016). Bigvid at mediaeval 2016: Predicting interestingness in images and videos. In MediaEval workshop, Hilversum, The Netherlands, October 20-21 (Vol. 1739), CEUR-WS.org.

  • Yalniz, I. Z., Jégou, H., Chen, K., Paluri, M., & Mahajan, D. (2019). Billion-scale semi-supervised learning for image classification. arXiv preprint arXiv:1905.00546.

  • Yang, Y.-H., & Chen, H. H. (2011). Ranking-based emotion recognition for music organization and retrieval. IEEE Transactions on Audio, Speech, and Language Processing, 19(4), 762–774.

    Article  Google Scholar 

  • Yannakakis, G. N., & Hallam, J. (2011). Ranking vs. preference: a comparative study of self-reporting In: International conference on affective computing and intelligent interaction, pp. 437–446, Springer.

Download references

Acknowledgements

We would like to acknowledge first, Technicolor France for founding and supporting the Interestingness10k data set and the Predicting Media Interestingness task. We acknowledge the work of our fellow task co-organizers (in alphabetical order): Alexey Ozerov, Frédéric Lefebvre, Hanli Wang, Michael Gygli, Toan Do, Vincent Demoulin, and Yu-Gang Jiang. We would like to acknowledge also the MediaEval Benchmarking Initiative for Multimedia Evaluation and in particular Martha Larson, for hosting the Predicting Media Interestingness Task, constant support and enlightening discussions. The work of Mihai Gabriel Constantin and Bogdan Ionescu was supported by the Ministry of Innovation and Research, UEFISCDI, project SPIA-VA, via agreement 2SOL/2017, and from project AI4Media, A European Excellence Centre for Media, Society and Democracy, H2020 ICT-48-2020, grant #951911. The work of Liviu-Daniel Ştefan was supported by the Operational Programme Human Capital of the Ministry of Europe Funds through the Financial Agreement 51675/ 09.07.2019, SMIS code 125125.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mihai Gabriel Constantin.

Additional information

Communicated by Antonio Torralba.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Constantin, M.G., Ştefan, LD., Ionescu, B. et al. Visual Interestingness Prediction: A Benchmark Framework and Literature Review. Int J Comput Vis 129, 1526–1550 (2021). https://doi.org/10.1007/s11263-021-01443-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-021-01443-1

Keywords

Navigation