Synthetic Distracted Driving (SynDD2) dataset for analyzing distracted behaviors and various gaze zones of a driver

This article presents a synthetic distracted driving (SynDD2 - a continuum of SynDD1) dataset for machine learning models to detect and analyze drivers' various distracted behavior and different gaze zones. We collected the data in a stationary vehicle using three in-vehicle cameras positioned at locations: on the dashboard, near the rearview mirror, and on the top right-side window corner. The dataset contains two activity types: distracted activities and gaze zones for each participant, and each activity type has two sets: without appearance blocks and with appearance blocks such as wearing a hat or sunglasses. The order and duration of each activity for each participant are random. In addition, the dataset contains manual annotations for each activity, having its start and end time annotated. Researchers could use this dataset to evaluate the performance of machine learning algorithms to classify various distracting activities and gaze zones of drivers.


Subject
Data Science

Specific subject area
Driver behavior analysis, Driver safety

Type of data
InfraRed Videos, Annotation files

How the data were acquired
Three in-vehicle cameras acquired data.We requested the participants to sit in the driver's seat and then instructed them to perform driverdistracting activities or gaze at some region for a short time interval.The instructions were played on a portable audio player.

Data format
Video files are .MP4 format and annotation files are .csvfiles

Description of data collection
We designed a survey using a Qualtrics form and selected the respondents based on the criteria that created a balanced representation by gender, age, and ethnicity.

Value of the data
• The data will serve as baseline data for the training, testing, and validation of computer vision-based machine learning models having the primary objective of detecting and classifying driver behaviors and gaze zones.• The data can be used to benchmark the performance of various machine learning models designed with a similar objective.• The data can be used by researchers working on analyzing driver behaviors whose objective is detecting and classifying driver activities.• The data can help researchers design and build a driver-assist system to improve drivers' safety by alerting them during driving.

Data description
The dataset consists of video files and annotation files.The videos were collected using dashcams with the specification shown in Table 1, and the data acquisition requirements are shown in Table 2.The annotation files(.csv)contain information, as shown in Table 3, and it includes information for all the camera views, as shown in Figure 1.The synthetic data collection process involved three in-vehicle cameras [8] positioned near the dashboard, rearview mirror, and top corner of the right-side window, as shown in Figure 1.We requested the participants to sit inside a stationary vehicle in the driver's seat.Then we gave them instructions (played using a tablet) to gaze at a particular region or perform a distracting activity continuously for a short time.The duration and order of activities were random for each participant.The dataset, thus generated, we call Synthetic Distracted Driving (SynDD2).

Experimental design, materials, and methods
The specification of the videos in the dataset is shown in Table 6.

Gaze zone
Figure 2 shows the eleven gaze zones in the car, and Table 7 lists all the gaze zones.

Distracted behavior
Each participant continuously performed sixteen distracted driver behavior, as shown in figure 3, for a small-time interval.We have listed the sixteen activities in Table 8.

Method
We requested the participants to follow the instructions, which were played on a portable audio player, and after the completion of one set of activities, we requested them to repeat the set by wearing a hat or sunglasses.One set of gaze activities took approximately 5-6 minutes, while the distracted driving activities took around 10 minutes.The whole set of activities took about one hour to finish.The sequence and duration of each activity were randomized for each participant to introduce complexity in the data for analysis.

Instructions for activities
We created an instruction video in the English language for both activity types for each participant.We used gTTS [9] (Google Text-to-Speech) to convert the instructions into text-tospeech form.The video showed the region to gaze for the gaze activity type, and for the distracted activity type, it displayed activity names in the English language, like drinking, eating, etc.The instructions started by explaining the kind of activity the participant would perform.
Then the instruction video played a small beep sound for the participant to begin acting.They continued to gaze until they heard a long beep sound for the gaze activity type, and for the distracted activity type, we requested them to act naturally: to stop performing whenever they wanted or when they heard a long beep.
We added the beep sounds to synchronize the videos from different camera views and help annotate the activities manually.

Data pre-processing
By default, each camera would split the video files after a fixed time interval.As a result, each participant's raw data had multiple video files from a single camera.Hence, we combined each participant's video files into a single file using python and FFmpeg [10].
"ffmpeg -f concat -safe 0 -i video-input-list.txt-c copy 'output'" where: video-input-list.txt-videofiles from a camera view in the increasing order of time output-the name of the output file We sorted the video files and added the file names in the video-input-list.txtfile.Then using FFmpeg, we concatenated the videos listed in the text file giving us a single file output.
After that, we divided the output video into multiple video files based on the activity types: gaze, gaze with appearance block, distracted, and distracted with appearance block for each camera view.
"ffmpeg -ss {start} -t {dur} -i {p} -c copy {out}" where: {start} -represents the start time of the activity type (gaze/distracted), {dur} -represents the length of that activity type, {p} -represents the path of the input file, and {out} -represents the output file name.
Finally, we synchronized the videos from the three camera views based on the beep sound played in the instruction.

Data annotation
We annotated each video from each camera for each participant manually.The annotation file includes each activity's start and end times-more information is in Table 1.

Figure 1 .
Figure 1.Showing camera positions inside the car

Table 1 .
Showing Specifications of the video acquisition system

Table 2 .
Showing Data acquisition requirement

Table 3 .
Showing variables in the data setFor each participant, there are twelve video files since each camera has two activity types (gaze/distracted), and each activity type has two sets (with/without appearance block), as shown in Table4.The video files are infra-red, and they do not contain any audio data.

Table 4 .
Showing different videos for one camera view

Table 6 .
Showing specifications of videos

Table 7 .
Showing gaze zones

Table 8 .
Showing distracted behavior