An Annotated Video Dataset for Computing Video Memorability

Using a collection of publicly available links to short form video clips of an average of 6 seconds duration each, 1,275 users manually annotated each video multiple times to indicate both long-term and short-term memorability of the videos. The annotations were gathered as part of an online memory game and measured a participant's ability to recall having seen the video previously when shown a collection of videos. The recognition tasks were performed on videos seen within the previous few minutes for short-term memorability and within the previous 24 to 72 hours for long-term memorability. Data includes the reaction times for each recognition of each video. Associated with each video are text descriptions (captions) as well as a collection of image-level features applied to 3 frames extracted from each video (start, middle and end). Video-level features are also provided. The dataset was used in the Video Memorability task as part of the MediaEval benchmark in 2020.


Video memorability Machine learning Human memory MediaEval Benchmark
A B S T R A C T Using a collection of publicly available links to short form video clips of an average of 6 seconds duration each, 1,275 users manually annotated each video multiple times to indicate both longterm and short-term memorability of the videos.The annotations were gathered as part of an online memory game and measured a participant's ability to recall having seen the video previously when shown a collection of videos.The recognition tasks were performed on videos seen within the previous few minutes for short-term memorability and within the previous 24 to 72 hours for long-term memorability.Data includes the reaction times for each recognition of each video.Associated with each video are text descriptions (captions) as well as a collection of image-level features applied to 3 frames extracted from each video (start, middle and end).Video-level features are also provided.The dataset was used in the Video Memorability task as part of the MediaEval benchmark in 2020.

Subject
Computer Vision and Pattern Recognition Specific subject area Ground truth data (videos, video features plus annotations) needed to build and train systems for the automatic computation of the memorability of short video clips Type of data Text files (csv) How data were acquired Raw videos are already publicly available online.Low level features were extracted automatically from videos and annotation data was collected through crowdsourcing using a video memorability game with the participation of both volunteers and paid workers on Amazon Mechanical Turk.Data format Raw Analyzed Parameters for data collection The maximum false alarm rate (short-term): 30% The maximum false alarm rate (long-term): 40% The minimum recognition rate of vigilance fillers (short-term): 70% The minimum recognition rate (long-term): 15% The false alarm rate must be lower than the recognition rate (long-term).Description of data collection 1,500 short videos selected from the Vimeo Creative Commons (V3C1) dataset and used in the TRECVid 2019 Video-to-Text task were divided into three non-overlapping subsets: training, development, and testing.Multiple manual memorability annotations for each video were collected via a video memorability game, which displays a series of short videos and requires users to press the spacebar when they recall a video previously seen by them.The game consists of two parts: in the first part where videos are repeated within a few minutes, the user interaction with a repeated video was collected to calculate short-term memorability scores.The second part took place between 24 and 72 hours after initial viewing of videos, and this time the participants' responses to previously seen videos from the first part were collected to acquire long-term memorability scores.After analysing the collected annotations, the short-term and the long-term memorability scores of each video were calculated as a percentage of correctly recalled videos, respectively.Each video memorabiity annotation is accompanied by the video timepoint offsets at which it was recalled by users, response times of the users, the key pressed when watching each video, and textual captions describing each video from the TRECVid benchmark.The Media Memorability 2020 dataset is included here with memorability annotations on 590 videos as part of the training set and 410 additional videos as part of the development set.In this dataset we provide memorability annotations for the development and training set videos but not the test set as this is used in future MediaEval memorability benchmark tasks.

Data source location
Primary data sources: TRECVid 2019 Video-to-Text dataset [1], available from: https: discover, find and retrieve digital content like video clips and that means automatically analysing video so that it can be found.Much work in the computer vision community has concentrated on analysing video in terms of its content, identifying objects in the video or activities taking place in the video but video has other characteristics such as aesthetics or interestingness or memorability.Video memorability refers to how easy it is for a person to remember seing a video and video memorability can be regarded as useful for a system to make a choice between competing videos on which video to present to a user when that user is searching for video clips.Video memorability will also be useful in areas like online advertising or video production where the memorability of a video clip will be important.The data provided here can be used to train a machine learning system to automatically calculate the likely memorability of a short form video clip.
• Researchers will find this data interesting if they work in the areas of human perception and scene understanding, such as image and video interestingness, memorability, attractiveness, aesthetics prediction, event detection, multimedia affect and perceptual analysis, multimedia content analysis, or machine learning.
• The dataset provides links to publicly available short form video clips, each of 6 seconds duration, features which describe those videos and annotations as to the memorability of those videos.This is all the data needed to train and evaluate the accuracy of machine learning classifier to predict video memorability.
• A huge amount of video material is now available to us at our fingertips, including from video sharing platforms like YouTube and Vimeo, video streaming platforms like Netflix and Amazon Prime, videos shared on social media platforms and even the video clips we ourselves generate on our smartphones.Unlike searching text documents on the WWW, searching through all this video content in order to find a clip you may have seen previously or a clip you think might exist but you are not sure and you would like to find it, such information search is not currently supported.Eventually technology companies will catch up with the growth in the amount of available video content and as they do, the intrinsic memorability of a video clip will be a characteristic of a video clip that will be important in deciding whether to retrieve it for a user.This means that video search will give us search results which will be with the better, more memorable videos more highly ranked.
• The specific use cases of creating video commercials or creating educational content re-quires videos which people will remember.Because the impact of different forms of visual multimedia content -images or videos -on human memory is unequal, the capability of predicting or computing the likely memorability of a clip of video content is obviously of high importance for professionals in the fields of advertising and education.
• Beyond advertising and educational applications, other areas such as filmmaking will find use for methods which calculate the memorability of video clips.We may see film and documentary makers creating videos in such a way that the key moments in a movie or documentary will be created in ways so as to maximise their likely memorability by the viewer and that opens up new ways for creating video material.

Data Description
The Media Memorability 2020 dataset contains a subset of short videos selected from the TRECVid 2019 Video-to-Text dataset [1] and a sample of frames from some of these is shown in Figure 1.We collected a minimum of 14 and a mean of 22 annotations in the short-term memorability step, and a minimum of 3 and a mean of 7 annotations in the long-term memorability step for  The dataset also contains the same features and structure in the training and development sets.
Table 3 presents the text files with the features and their descriptions.• AlexNetFC7 (image-level feature) [4] • HOG (image-level feature) [5] • HSVHist (image-level feature) • RGBHist (image-level feature) • LBP (image-level feature) [6] • VGGFC7 (image-level feature) [7] • C3D (video-level feature) [8] For image-level features we extract features from 3 frames for each video, each one in an individual file, where the filenames are composed as follows: <video id>-<frame no>.csv.The 3 frames per video represent the first, the middle and the last frames in the video clip.For example, for video id 8 we extract the following AlexNet feature-files): • AlexNetFC7/00008-000.csv : AlexNetFC7 feature for video id = 8, frame no = 0 (first frame) • AlexNetFC7/00008-098.csv : AlexNetFC7 feature for video id = 8, frame no = 98 (middle frame) • AlexNetFC7/00008-195.csv : AlexNetFC7 feature for video id = 8, frame no = 195 (last frame) For video-level features we extract 1 feature for each video, where the filenames are composed as follows: <video id>.mp4.csv.Using the same video id 8 as an example, we extract the following C3D feature-file: • C3D/00008.mp4.csv: C3D features for video id = 8 Figure 3 shows the minimum and maximum reaction times for the annotations for short-term memorability for each of the 590 videos in the training set while Figure 4 shows the same for long-term memorability.The figures reads from left to right, with each column being the vertical continuation of the preceding column.Reaction times are sorted greatest to shortest difference between minimum and maximum reaction time, x-axis is the reaction time in milliseconds, numbers on the y-axes refer to video id.The figure illustrates a large range of min-to-max reaction times, those appearing later in the graph (rightmost column, towards the bottom) appear to be universally memorable to all annotators while those at the other end of the graph are memorable to some as soon as video playback commences, and less memorable to others.The positioning of the blue dots in the graph indicates that all videos have at least some annotators who remember the video early during the playback, in many instances almost as soon as video playback commences.The differences between short-and long-term memorability annotations indicate long-term recall happens sooner, i.e., earlier during video playback.

Experimental Design, Materials and Methods
Each video has two associated scores of memorability that refer to its probability to be remembered after two different durations of memory retention.Memorability has been measured using recognition tests, i.e., through an objective measure, a few minutes after the memorisation of the videos (short-term), and then 24 to 72 hours later (long-term).
The ground truth dataset was collected using a video memorability game protocol proposed by Cohendet et al. [2].In a first step (short-term memorization), participants watched 180 videos, among which 40 target videos are repeated after a few minutes to collect short-term memorability Both short-term and long-term memorability scores are calculated as the percentage of correct recognitions for each video, by the participants.
The experimental protocol was written in PhP and JavaScript (a modified version of the JavaScript library in [9] was used) and interacts with a MySQl database.The interaction with Amazon mechanical turk was performed through JavaScript code.The optimisation problem for generating positions was written in Matlab.A participant could participate only once in the study.The order of videos was randomly assigned, using an algorithm that randomly selects from among the last 1,000 least annotated videos, and which generates random positions from 45 to 100 videos (i.e., 4 to 9 minutes).
Several vigilance tests were settled up upon the results on an in-lab test and only participants that met the controls were retained for the analysis: 1. 20 vigilance fillers were added in the short-term step and we expected a recognition rate of those fillers of 70%.
2. a minimal recognition rate of 15% in the long-term step.
3. a maximal false alarm rate of 30% for short-term and 40% for long-term.
4. a false alarm rate lower than the recognition rate for long-term.
Two versions of the memorability game using three language options: English, Spanish and The dataset contains links to, as well as features describing and annotations on, 590 videos as part of the training set and 410 videos as part of development set.It also contains links to, and features describing, 500 videos used as test videos for the MediaEval Video Memorability benchmark in 2020.

Fig. 1 .
Fig. 1.A sample of frames from some of the videos in the TRECVid 2019 Video-to-Text dataset.

Fig. 2 .
Fig. 2. The number of annotations in the training, development, and test sets.

Fig. 3 .
Fig. 3. Comparison of the short-term minimum and maximum reaction times for 510 videos.

Fig. 4 .
Fig. 4. Comparison of the long-term minimum and maximum reaction times for 510 videos.
Turkish were published for different audiences and in different contexts.One was published on Amazon Mechanical Turk (AMT) and another was issued for general use among an audience essentially made up of students.A total of 1,275 different users participated in the short-term memorability step while 602 participated in the long-term memorability step.Only about 48% supported by Science Foundation Ireland (SFI) under Grant Number SFI/12/RC/2289 P2, cofunded by the European Regional Development Fund.The work of Rukiye Savran Kızıltepe is partially funded by the Turkish Ministry of National Education.Funding for the annotation of videos was provided through an award from NIST No. 60NANB19D155.We thank Cohendet et al.[2] for sharing their source code.

Table 2 .
Text files in the training and development sets of the MediaEval2020 Predicting Media Memorability dataset

Table 3 :
The text files in the training and development sets of the MediaEval2020 Predicting Media Memorability dataset Additional pre-computed features are provided in individual folders per feature type and in individual csv files per sample, which are available in the data repository.There are seven folders containing the seven features for each of the training, development and test sets as follows: