Data2MV - A user behaviour dataset for multi-view scenarios

The Data2MV dataset contains gaze fixation data obtained through experimental procedures from a total of 45 participants using an Intel RealSense F200 camera module and seven different video playlists. Each of the playlists had an approximate duration of 20 minutes and was viewed at least 17 times, with raw tracking data being recorded with a 0.05 second interval. The Data2MV dataset encompasses a total of 1.000.845 gaze fixations, gathered across a total of 128 experiments. It is also composed of 68.393 image frames, extracted from each of the 6 videos selected for these experiments, and an equal quantity of saliency maps, generated from aggregate fixation data. Software tools to obtain saliency maps and generate complementary plots are also provided as an open-source software package. The Data2MV dataset was publicly released to the research community on Mendeley Data and constitutes an important contribution to reduce the current scarcity of such data, particularly in immersive, multi-view streaming scenarios.


Value of the Data
• The dataset is composed of 1.0 0 0.845 gaze fixations, gathered across 128 experiments, 7 video playlists, and 119 views.Raw and post-processed tracking data are provided as TXT and CSV files, respectively.• 68.393 image frames (and corresponding saliency maps) are provided as JPG files.
• Data was gathered from a total of 45 participants for the same multi-view video content at different time instances, encompassing all possible gaze fixation variations during preestablished viewing periods.Outlier data (from initial set-up and calibration) was filtered to discard potential fixation bias.• The dataset contributes to the training, testing, and evaluation of visual attention models, where gaze fixation data is required for the prediction of user gaze behavior and adjustment of content (and/or view) selection, preparation, and distribution.• Data can be used to evaluate the degree of accuracy of interactive multi-view streaming systems (e.g., selection of new viewpoints based on user viewing behavior data), their effect on system performance (e.g., view switching latency, segment download/buffering), and their overall impact on existing mechanisms (e.g., predictive view selection, multiview content presentation).• Due to the scarce availability of datasets targeted for multi-view scenarios, the dataset increases the limited pool of available options with a significant sample of high-resolution image frames, saliency maps, and tracking data (e.g., three-dimensional point clouds) and provides software tools for data management and visualization.

Objective
The most common option to obtain large sets of gaze fixation data is to use Head-Mounted Devices (HMDs) [1 , 2] .The availability of such datasets is of high importance to the research community investigating novel solutions for immersive video applications (e.g., 360-degree video streaming [3] ).However, using HMDs is not trivial and is often not within reach of everyone.We have thus devised an alternative setup for collecting such data, eliminating the need to utilize HMDs while providing the ability to create datasets useful for a wider range of applications, namely multi-view [4] .Data was collected using an RGB-D camera module (Intel RealSense F200 [5] ), in order to capture the attention of the user on the screen when watching multi-view content.In addition to being a non-intrusive, cost-effective solution, the selected RGB-D camera module can be used for interactive tasks without resorting to additional body-centric equipment.

Data Description
The newly developed dataset described throughout this article encompasses several categories of data, with their own specificities (e.g., video frames, saliency maps, log files).The following section will describe which data can be found in the dataset and, in particular, how such data is organized for better access, comprehension and handling.

Data overview
The dataset is composed of 68.393 image frames, extracted from 360-degree multi-view video content ( Fig. 4 ), and saliency maps, generated from cumulative gaze tracking data collected with the RGB-D camera module ( Fig. 5 ).A representative example of the process of acquiring gaze fixation data during content playback is shown in Figs. 2 and 3 .Gaze tracking data was collected with a 0.05 second interval from a total of 45 individuals while viewing 7 video playlists composed by multiple perspectives from original 360-degree content (split into 6 individual views according to cubemap projection).The most significant characteristics of each of the selected videos are presented in Table 1 .The gender distribution between participants was almost equal (24 males and 21 females), with every participant being of the same race (white).Ages ranged between 22 and 88 years old, with an average age of 33.33 years and a mode of 25.00 years.28 of the 45 users (57.7% of the total number of participants) were between 22 and 26 years old.Tracking data was stored using two different file formats: TXT files (raw data, acquired directly through the custom client software) and CSV files (generated from raw data, to enable faster data processing).Raw data files encompass the following data: (1) Anonymized identifier; 2) Age; 3) Gender; 4) Video filename; 5) Video timestamp; 6) Position within the Hot&Cold matrix; 7) Raw gaze tracking data from the RGB-D camera, stored under the form of Three Degrees of Freedom (3DoF) [19] orientation data: pitch, yaw and roll ; 8) Three-dimensional positioning data ( x, y , and z axes values), calculated from Euler angles [20] provided by the RGB-D camera for each gaze fixation; 9) Three-dimensional point clouds related to gaze fixations from each participant, as acquired by the RGB-D camera.Additional material is also included with the dataset: an overview of the participant data and video playlist distribution, and complementary software tools developed for data extraction and plot generation.The dataset is 6.73GB in total size and was publicly distributed through Mendeley Data [21 , 22] .

Data organization
To facilitate access for potential users, data was distributed across six separate directories, as depicted in the directory tree presented in Fig. 1 .Contents from each of these directories will now be discussed in detail to provide a clear picture of what type of data can be expected.With regard to the Video Stimuli directory, it is composed of a compressed file ( Video Stimuli.7z ) which aggregates image frames extracted from each of the selected 360-degree videos.Views from these videos were spread across individual subdirectories (119 in total), each with a unique identifier (e.g., Video 1 ).Additionally, for each image frame made available, a matching saliency map (a graphical representation of user visual behavior data, as represented in Fig. 5 ) was also generated from cumulative tracking data.These maps can be accessed in the corresponding directory through the available compressed file ( Saliency Maps.7z ).Participant data is available in the homonyms direc-tory, encompassing an Excel spreadsheet ( Playlists and Participant Distribution.xlsx ) where users can consult relevant information from participants (e.g., age, gender), along with participant distribution across available video playlists and views.As for tracking data collected from participants during the experimental procedures, it can be consulted in the Log Files directory.Two types of data formats are made available with this dataset: raw data (defined as TXT) and post-processed data (identified as CSV).For this purpose, individual data files are provided for each playlist, view, and participant combination.Spatial and temporal perceptual information, calculated using siti-tools [23] for the six 360-degree videos, is provided under the respective directory to further characterize the complexity of the selected video content [24] .Complementary software tools, developed with the sole purpose of facilitating the creation process from this dataset, are also provided for two specific tasks: data preparation (e.g., conversion of raw data to post-processed format) and data visualization (e.g., generation of plots from available data).

Relevance of data
The goal of the dataset is to provide gaze data collected while visualizing multi-view content.The majority of datasets that are currently available for usage (the selection of the most relevant options is presented in Table 2 ) provide gaze data captured from HMDs while visualizing 360-degree content.Its applicability to multi-view solutions is considered limited due to the specific requirements raised by these types of scenarios (e.g., single-screen visualization, selection of different viewpoints for content presentation).Comparatively, the Data2MV dataset provides gaze data from a wide range of participants (45 users), captured by an RGB-D camera module while visualizing multi-view content.This content was spread across 7 video playlists, with each playlist being viewed a minimum of 17 times.Due to the scarcity of high-resolution

Experimental Design, Materials and Methods
Experimental procedures conducted during the creation of the dataset required a specific set of conditions in order to conclude such a task with success.For example, the selection of appropriate hardware and multi-view video content to enable meaningful data acquisition or the development of a custom software solution for accu-rate data processing and analysis.The following section will delve in detail into each of these components.

Hardware setup
An experimental setup, composed of a laptop computer, a computer monitor, and an RGB-D camera module (as visible in Fig. 2 ), was defined for the purpose of acquiring gaze fixation data.With regard to the RGB-D camera module, an Intel RealSense F200 camera was physically installed below the computer monitor (using a monitor stand) and connected to an HP Elitebook 940 G2 laptop computer using an USB-3.0 to USB-3.0 connection.The RGD-D camera module is composed of five core elements: a color sensor, an infrared sensor (IR), an IR laser projector, an image processor, and a stereo microphone [25] .For the development of this dataset, solely the color sensor, IR sensor, and IR laser projector were used to capture depth data in Full HD resolution (1920 x 1080) at 30 frames per second.Depth data was acquired from users using a three-step process [5] : 1) An IR laser projector emits a structured light pattern; 2) An IR sensor detects reflected light patterns on objects or individuals; 3) 3D surfaces are reconstructed from reflected patterns and stored as point clouds.According to the specifications from Intel, users must be situated within the recommended range of 0.2-1.2meters in order to achieve optimum depth accuracy values (prior literature [5] confirms that an average depth accuracy between 1.46 and 1.53mm RMS is achieved by the Intel Realsense F200 under similar conditions).A 24" Full HD (1920 × 1080) computer monitor was installed for the presentation of multi-view content to selected observers at a distance situated between 0.8 and 1.0 meters and with a FoV of approximately 56 º.

Client software
To collect gaze data in real-time, a custom client software solution was developed in C#, combining Intel RealSense SDK tracking features [26] with Windows Media Player video play-  back capabilities.An overview of the workflow used for the collection of gaze tracking data using the client software can be visualized in Fig. 6 .The workflow is initiated with the obligatory calibration procedure, conducted prior to any viewing session: users were asked to follow a red dot presented across the screen in order to confirm that tracking data accu-rately represented their facial movements.After this initial procedure was concluded with success, multi-view content was presented to users according to the video composition included within each playlist.Selected multi-view content was downloaded from the dataset storage and presented to users in full screen (using Windows Media Player components for such purposes).During the pre-  sentation of multi-view content, gaze tracking data (e.g., three-dimensional point clouds, Euler angles) was simultaneously collected in real-time, synchro-nized with multi-view content data (e.g., timestamps), and continuously stored in the form of raw data in TXT files.Additionally, the conversion of gaze fixation data from a three-dimensional space into a two-dimensional coordinate space was also conducted by the client software solution: x, y and z axes values were computed from pitch, yaw and roll values previously acquired and stored as Euler angles [20] by the RGB-D camera module.These computed values were projected into the Hot&Cold matrix structure [4] (considering camera parameters such as center calibration, variable FoV and distortion [27] ), delivering a graphical representation of users' viewing behavior in real-time ( Fig. 3 ), during content playback.After the process of collecting gaze data was concluded, an additional set  of post-processing tasks were also conducted, as depicted in the workflow presented throughout Fig. 7 : 1) For each available multi-view video content used for gaze data acquisition, a predefined set of image frames was extracted using FFMpeg tools based on its overall relevance (e.g., keyframes); 2) For each of the extracted image frames, corresponding saliency maps were generated us-ing the cumulative gaze tracking data previously collected from the participants; 3) Image frames and related saliency maps were stored in the corresponding playlist/video directory, available at the dataset storage location.

Test stimulus
To acquire gaze fixation data from participants, six 360-degree videos (available under Creative Commons copyright) were selected for playback purposes (video specifications can be consulted in Table 1 ).These were originally presented in equi-rectangular format and encoded in 4K resolution (3840 2160 pixels), with frame-rates ranging between 24 and 30 frames per second.To present each perspective according to cubemap projection, videos were split into six individual views (each with a resolution of 1280 1080 pixels).From the 36 views created, 7 were discarded due to a lack of visual interest in its content.Smaller portions of the remaining views (13 in total) were also discarded to allow for a better distribution of views among selected playlists.To characterize content from selected videos, four categories were considered: Outdoor (e.g., natural landscapes), Urban (e.g., urban objects and architecture), Rural (e.g., non-urban environments), and People (e.g., human presence).Additionally, Temporal Perceptual Information (TI) and Spatial Perceptual Information (SI) data were also computed for each of the videos by resorting to Sobel filters (3 × 3 pixels) [28] .

Ethics Statements
Being a free-for-all experimental activity, participants demonstrated their interest at their own discretion.To guarantee that General Data Protection Regulation (GDPR) guidelines [29] were strictly followed, the following procedures were conducted: 1) Request permission from participants to collect, store, analyze, and publish collected data; 2) Restrict data gathered from participants during the experimental procedure; 3) Subject data to anonymization techniques (e.g., removal of personal identification data).

Fig. 1 .
Fig. 1.A directory tree depicting the structure of the dataset and its content.

Fig. 2 .
Fig. 2. The RGB-D camera setup, used for collection of gaze data from users.

Fig. 4 .
Fig. 4. A representative set of video frames from available views on each of the selected playlists.

Fig. 5 .
Fig. 5.A representative set of saliency maps from corresponding video frames.

Fig. 6 .
Fig. 6.Sequence diagram detailing the process of collecting gaze data from participants.

Table 1
Main properties of the omnidirectional video dataset.

Table 2
Selection of the most relevant datasets for immersive scenarios and applications.

Table 2 (
continued ) multi-view datasets with sufficient content length to provide adequate viewing behavior data, this dataset provides valuable gaze data, which enables its use in interactive multi-view scenarios where elements such as viewing behavior and viewpoint selection are deemed essential.