Spectral Representation Learning and Fusion for Autonomous Vehicles Trip Description Exploiting Recurrent Transformer

A thorough analysis and comprehension of the entire cue set in visual data are indispensable for an ideal video description model. As outlined in recent algorithm proposals, video descriptions have primarily been generated by learning RGB and optical flow representations rather than exploring and incorporating the media’s spectral components referring to the patterns or characteristics in the distribution of colors or intensities across different frequencies or wavelengths of light. These components may enhance the description quality and impact the generated text for accuracy, diversity, and coherence. We propose a novel Fourier-based algorithm for extracting spectral features in 3D visual volume by decomposing the video signal into its frequency components, to fill this research gap. Further, the captured spectral features are fused with learned spatial and temporal representations in recurrent transformer architecture for accurate content understanding and appropriate description generation in natural language. The transformer includes an external memory module that produces summarized memory states based on the history of previously observed video fragments and already-generated sentences. These memory states ensure the establishment of sound semantic and linguistic cues. As a result, our proposed algorithm integrates spatial, temporal, spectral, and semantic representations for precise and grammatically accurate descriptions. The effectiveness of our proposed algorithm for the coherent and diverse video description is demonstrated through qualitative and quantitative experimentation on the DeepRide driving trip description dataset. A comprehensive ablation study validates the efficacy of the spectral features fusion with spatial and temporal visual representations for the rich video-to-textual narration generation.


I. INTRODUCTION
A straightforward insubstantial conversion of visual information from a video to text is insufficient to interpret its content comprehensively. Even though object and action detection can contribute to caption generation in some cases, a comprehensive video description model must be able to comprehend The associate editor coordinating the review of this manuscript and approving it for publication was Thomas Canhao Xu . all visual cues present in a video. An encoder and a decoder constitute the core architecture of video description [1], [2], [3], [4], [5]. A context vector is generated by processing visual information using convolutional or recurrent neural networks. Following the context vector, the decoder aligns its elements with its corresponding text in natural language, allowing the caption to be generated word by word as considering the context vector and previously generated words.
In recent video description algorithms [6], [7], [8], [9], [10], [11], [12]; spatial features (RGB) and temporal features (optical flow) are extracted and utilized to understand visual information across multiple frames. In spite of the fact that these features are helpful to some extent, they may not be sufficient to adequately describe videos that contain diverse data, color ranges, intensities, and representations beyond spatial and temporal aspects. It is important to thoroughly explore all features of a video in order to create an accurate natural language-based textual description. Having more extracted features (related information) increases the ability of developed algorithm to describe the visual data in a natural language manner. It is important to note that the observable visual information alone is insufficient to provide a comprehensive understanding of the content in a video description system. So, all cues and information contained within the visual data must be carefully and thoroughly investigated. The collection of pixel-wise information of a video frame constitutes its spatial or RGB set of information primarily representing objects or states in a scene. While optical flow or temporal data captures the difference in pixel positions between successive frames, it enables a better representation of action or verb in the video.
The semantic information contained within media data provides linguistic cues to the language model, allowing the model to develop a prior understanding of video content through a contextual perception informed by semantic information. This understanding makes it easier to generate accurate, diverse, and grammatically accurate descriptions in natural language. The semantic representations have the potential to enhance video description generation at various levels of granularity [13]. Additionally, using an external memory block [9] facilitates the creation of concise memory states by taking into account the history of previous segments of video and the sentences that have already been generated. The approach ensures that solid semantic and linguistic clues are incorporated in order to generate effective video descriptions.

A. MOTIVATION
Integrating additional sources of information through the learning and fusion of spectral representations can significantly enhance the accuracy and diversity of generated video descriptions. The fine details of the video content can be captured to obtain a more accurate and reliable textual description of the video. Additionally, spectral features tend to generalize to a broader range of video types and are more capable of handling variations in lighting, camera angles, and motion that can vary significantly from one video to the next. Further, extracting relevant features from the spectral domain can reduce the computational requirements for processing video data, resulting in improved efficiency and scalability. Exploiting spectral features in video-to-text models allows the employment of inherited information within the video data, resulting in applications in a variety of real-world scenarios and challenges.
Involving semantic information of any granularity boosts the description quality because of prior knowledge, i.e., video category, action info, and topic feeding to the decoder regarding video content [13]. The same also applies to the inclusion of audio, sound, and text features. Adding in-depth knowledge about the visual content facilitates its diversity and beyond visual explanation in natural language. Likewise, recent articles [14], [15] motivated us to explore spectral representations as a third mode of features contained within the visual data of a video. The research work [14] advanced the research presented by FNO (Fourier Neural Operator) [16] and proposed a spatio-spectral neural operator combining spectral feature and spatial feature learning primarily dealing with CFD (Computational Fluid Dynamics), resulting in the lowest relative error achieved. It suggested descending mode selection approach to provide different spectral features at each following layer. The DSFA [15] proposed a novel deep spectral-aggregation approach with block-wide feature aggregation consuming the Fourier neural operator. It incorporated spectral channel compression to extract the most learned information and kept it from information loss during layer cascading. FNO was previously used for learning mappings between function spaces considering parametric partial differential equations and has never been utilized in the video domain for representation learning. It is the first time, to the best of our knowledge, employing FNO in the video domain for spectral representation learning.
The learned spectral representations can convey additional meanings along with the video's spatial and temporal connotations. The concept of spectral feature extraction encouraged us to explore this research direction toward the visio-linguistic task of the video description. The union of three feature sets, i.e., spatial, temporal, and spectral, is best capable of competing description generation in natural language. An illustration (just for visualization/ concept understanding) of these features overlaps and how the union of all these can create a better algorithm is depicted in the Venn diagram in Figure 2.

B. SPECTRAL FEATURES
Spectral features refer to the frequency content of a signal, which can be analyzed by employing Fourier transforms. The use of spectral features for tasks such as audio and speech recognition has proven valuable in capturing signal pitch and frequency information. Particular to visual data, a video contains static bitmap information in frames and nature-based physics patterns that follow linear and differential equations. For basic RGB and flow features extraction [6], [8], [17], [18], we can employ convolutional neural networks. However, we cannot learn increasingly complex representations with simple neural networks because they are not designed to learn nature-based physics-informed patterns following specific linear and differential equivalences. Videos involving human action or vehicle movement have a high probability of physical motions and patterns.
Specific to dashcam videos, they record the scenes in front of the vehicle and capture a wide range of colors, intensities across different frequencies of light, and texture of objects and scenes in the video. Therefore, an extensive investigation is crucial to examine the effect of spectral representation learning and physics-based information from within the visual information on the overall performance of textual narration generation.
The availability of the spectral features in visual data strongly depends on the contents of the video and the analysis methodology. Videos having scenes with bright or dark areas exhibiting variations in light intensity across different frequencies are characterized as having spectral features. Likewise, dashcam driving videos recorded during day time and night time capture different light intensities and brightness, hence containing spectral information. Generally, spectral analysis of videos is a common technique in the fields of computer vision, image processing, and video analysis. Identifying these features allows researchers to extract useful information from videos, such as object recognition, motion detection, and scene segmentation.
Deriving spectral features from driving videos recorded with dashcams may prove helpful in several applications, such as detecting road signs and traffic lights, scene recognition, road surface analysis, lane marking, and obstacle detection. Analyzing the spectral properties of objects, environment, and road surface in the dashcam driving video allows the detection and recognition of vehicles, pedestrians, traffic signs, urban-rural areas, highways, lane markings, potholes, and speed bumps. Resulting in enhanced description generation for objects in the scene, contextual information about the video, and insights into the road conditions. Further, with the inclusion of information about the color and texture of objects and scenes, textual descriptions can provide a complete understanding of video content, which can be beneficial in a variety of applications, such as driver assistance, safety analysis, and road maintenance.

C. SPATIAL FEATURES
The deep architecture of CNN learns a hierarchy of features like pixels, edges, shapes, patterns, regions, and objects from the set of training images. CNNs are designed to automatically learn appearance hierarchies of representation employing backpropagation by utilizing certain deep-learning building blocks. Recent video description research primarily focuses on extracting appearance information to identify objects, patterns, and shapes for robust text generation. Extraction of spatial or appearance features for video/image captioning [6], [7], [8], [10], [11], [12], [19] is a fundamental paradigm. Specific video description algorithms employ state-of-the-art 2-D CNN and a 3-D CNN pre-trained on large-scale datasets to extract spatial/appearance features.
Considering dashcam driving videos, spatial features can generate text descriptions from driving videos by analyzing the spatial distribution of spatial layout, objects, events, the VOLUME 11, 2023 environment's structure, and accidents or near-miss situations to convey the content and context of the video accurately.

D. TEMPORAL FEATURES
As a key feature of video analysis and text description generation, temporal features capture information regarding the motion and dynamics of the objects and events being observed in the video. The optical-flow features are intended to provide information about the displacement of objects within consecutive video frames employing per-pixel motion estimation. The history of optical flow prior to the deep learning era includes data matching errors, exploration of spatial statistics using sequences generated from depth maps, and stochastic optimization [20]. Recently, to approximate optical flow, most top-performing algorithms employ CNNs, i.e., PWC-Net [21], I3D [22], and C3D [23]. Particular to video captioning task [17], [18], [24], [25], [26], and [27] proposed temporal representations learning, i.e., local and global temporal features [17] for enhanced captioning performance.
Regarding dashcam driving videos, temporal features may provide valuable information for a variety of driving-related tasks, including collision detection, lane change detection, and event recognition. It is possible to detect and track the motion of objects such as vehicles, pedestrians, and cyclists through temporal analysis of the time-based changes in the position of objects in the video. Similarly, road event detection and classification can be carried out through temporal analysis at sudden stops, swerves, or collisions. Further, the temporal representation learning and fusion technique also allows for detecting and classifying driver behavior, such as speeding, sudden lane changes, and aggressive driving. Overall, temporal features learning provides essential information about the dynamics of objects and events in the video and can facilitate accurately describing the movement of objects in the scene, the events and their context, and insights into driver behavior to improve driver safety.
Furthermore, to thoroughly understand the spatial, temporal, and nature-based flow of information, a shared recurrent transformer architecture [28], [29], [30] is established for automatic textual narration generation. A memoryaugmented architecture inspired by [9] is adopted for modeling the previous history information of video segments and generated sentences. The memory module's cell update functionality conceptually mimics the LSTM and GRU, capable of modeling complex relations. It also ensures that semantic cues are conveyed to the recurrent transformer's decoder for better interpretation of the linguistic cues.
To demonstrate the efficacy of the proposed algorithm, we have conducted extensive experiments on a large-scale DeepRide dataset [31]. DeepRide features 16k dashcam videos corresponding to around 130k sentences in 16k paragraph descriptions. The dashcam driving videos are recorded in diverse weather conditions and varied scene locations. DeepRide dataset dashcam video's sample frames with ground truth description from training split with multiple-sentence trip description are depicted in Figure 1. We believe the superior performance of the proposed algorithm demonstrates the validity of the concept of the union of all learnable representations to boost video description generation. Moreover, a thorough ablation analysis is also performed with all possible compositions of learned features set to establish the effectiveness of the proposed algorithm for automatic text description generation using natural language.
Our contributions toward this research work are as follows: 1) We design a novel 3D Fourier-based neural network (F3D) to learn the nature-aware representations from the 3D video volume and extract spectral features by learning the spatial domain into the spectral domain. 2) We demonstrate the significance of the spectral features by developing a video description framework that incorporates the visual features (spatial, temporal, and spectral) and text features as input and learns the mapping at its best. 3) We observe and collect the results to prove the significance of spectral features in videos. Our proposed framework achieved state-of-the-art results on the DeepRide dataset with competitive performance. 4) We also performed an ablation study considering all possible compositions of the representation learning mechanisms and proved the worth of our proposed algorithm for accurate and diverse description generation.

1) PROBLEM STATEMENT
We are provided with an untrimmed video X for video description generation such that where x i represents frame and |X | is the number of total frames in the video X . The video contains n events such that where e i is the event and |n| is the total number of events in the video with the event start timestamp (e i ) startTime and event end timestamp (e i ) EndTime . The goal of our proposed video description model is to generate a paragraph P comprising natural language sentences describing the given videoX such that where each sentence S i describes one event in the video and consists of m words such that S = {w 1 , w 2 , . . . , w m }.
For each video X we are extracting three types of features, i.e., spatial S, temporal T , and spectral F and then concatenating all these features for better visual understanding and exact textual interpretation such that γ (features) = {F ⊕ S ⊕ T } where ⊕ represents feature concatenation or fusion.

2) PAPER ORGANIZATION
The rest of the paper is organized as follows: Section II provides a detailed outline of the related literature on the topic; Section III explores the methodology, including the overview and detailed analysis of the proposed algorithm for spectral features extraction and its fusion with spatial and temporal representations within the recurrent transformer for video description generation. Section IV presents the experimental results to validate the model's qualitative and quantitative performances along with a detailed discussion of the results achieved, followed by an ablation study, and finally, the paper is concluded in Section V.

II. RELATED WORKS
This section briefly overviews existing research that directly impacts our proposal and research methodology.
Recent research demonstrates that the transformers have also been employed as visual or linguistic components of the ED (Encoder-Decoder) structure for video description [61]. In research work by [2], they employed reinforcement learning and an encoder composed of transformer encoder blocks to extract the features of a video over a global view, which reduced the amount of information lost in the intermediate hidden layer.
Aiming to describe the video comprehensively and accurately, this research systematically analyzes hidden aspects of media, such as information derived from physical phenomena, nature, spectral data, and spatial and temporal facets.

B. MEMORY MODELLING
An architectural concept named memory modeling refers to a computational framework that can be used for predicting the behavior of a system by using an attached memory block. In memory-enabled models, a recurrent implicit update can be performed using LSTMs [62] or GRUs [63], and explicit memory accessibility is accomplished via attention-based algorithms.
Memory-augmented recurrent neural networks are believed to be ideal for understanding long videos and providing dense captions [64]. The purpose of these systems is primarily contextual understanding by enabling the storage and retrieval of memory cells which remained problematic for traditional RNNs.
After extending the memory ability of traditional neural networks, several memory-augmented networks have been employed recently for tasks requiring dynamic reasoning, such as textual question answering, visual question answering, video summarization [65], image captioning, and video description. The proposed multi-modal memory [66] described videos to text by building a visual-textual shared memory and enhancing long-term visual-textual dependency modeling with multiple read and write operations. The limited memory capacity of RNN-based architectures and the disregard for the rich interaction between the regions of the image led [67] to design a latent memory-guided graph transformer (LMGT) by simultaneously capturing the visual relationships of each image as well as the sequential relationships across the image stream to better correspond to topic consistency and preserve inter-sentence coherence.
While transformer architectures can model historical information well, they have limited capabilities in modeling current information appropriately [29]. Through a segment-level recurrence mechanism, Transformer-XL [29] enables the acquisition of learning dependency beyond a fixed length without disrupting temporal coherence, whereas MART [9] enables the transmission of semantic and linguistic information more efficiently by utilizing highly summarized memory states to pass along helpful information.
This article aims to thoroughly investigate all aspects of the supplied video to describe grammatically correct and diverse natural language sentences to help people understand the detailed factual visual content of all video segments.

C. PHYSICS-INFORMED NEURAL NETWORK
It has been demonstrated that physics-informed neural networks (PINNs) can learn functional spaces directly and  provide accurate approximations in a relatively short time compared to conventional mechanisms. PINN is a deep learning-based technique that bridges the gap between machine learning and scientific computing and seamlessly integrates data with mathematical models. Recent advances in deep Learning and Physics-Informed Neural Networks [68], [69], [70] have generated a great deal of interest in the scientific research and engineering fields.
A novel interface for machine learning [71] introduces mechanisms for integrating physical principles into deep neural networks and enables synergistic combinations. Furthermore, recent articles [14], and [15] have inspired us to consider spectral representations as a third mode of visual features contained within video data. They can convey added meanings in addition to the video's spatial and temporal connotations. Here authors presented a novel deep Fourier neural network that employed a Fourier neural operator as a fundamental building block and utilized spectral feature aggregation to extend the information set. Spatial convolutions learn representations as spatial features, whereas spectral convolutions learn functional spaces directly. In addition to taking into account the physical laws that govern the behavior of the objects in the video, these models can generate descriptions that are more accurate and consistent in capturing the dynamics of the scene.

D. FEATURE EXTRACTION AND FUSION
A significant contribution to the accuracy and diversity of the generated description comes from the extraction and aggregation of representative features found within the video modalities. Caption-generating algorithms, ignoring the multi-modal nature of videos, have a limited capability to describe and convey the visual content in textual narration accurately. The multi-modality can include visual, audio, sound, and metadata considering video, whereas spatial, temporal, semantic, and spectral information corresponds to the visual representations and 2D/3D scene/object/action recognition visual/intermediate features [72], [73].
Based on the proposed SART [13], semantic features captured in videos can be fused with spatial and temporal representations to improve the captioning accuracy of videos. Their scenario understanding module with fine-grained semantic information provided a more comprehensive understanding of visual content at the clip level than at the video level.
Finally, it is imperative for an ideal description generation algorithm to decide on the intelligent choice of visual representations, i.e., spatial, temporal, semantic, and spectral. Furthermore, an efficient and capable aggregation or concatenation strategy is indispensable. Exploiting multiple sources of information and attention mechanisms to weigh the contribution of each learned representation in vision to language models provide a comprehensive representation of the video content and definitely produce more accurate and informative text descriptions of videos.

III. METHODOLOGY
We propose a two-step methodology for video description generation. The first step deals with the novel representation learning of visual data in the spectral domain using Fourier transform. Further, the second step involves the fusion of the learned spectral features with spatial and temporal representation in a memory-enabled recurrent transformer for video The novel representation learning of visual data in the spectral domain using the Fourier transform involves converting pixel values from the spatial to the frequency domain for each selected frame. It results in a frequency spectrum contributing each frequency component to the original frame content. The frequency spectra obtained from the Fourier transform can be analyzed to extract spectral features unique to the video content. These features may include periodic patterns, texture variations, or spatial arrangements characteristic of the video content.
A spectral block is adopted from our previous research work DSFA [15] (Deep Spectral Feature Aggregation Physics Informed Neural Network) and SSNO [14] (Spatio-Spectral Neural Operator for Functional Space Learning of Partial Differential Equations). SSNO advanced the research presented by FNO and proposed a spatio-spectral neural operator combining spectral feature and spatial feature learning primarily dealing with CFD (Computational Fluid Dynamics) and achieved lowest relative error. It suggested descending mode selection approach to provide different spectral features at each following layer. DSFA proposed a novel deep spectral-aggregation approach with block-wide feature aggregation employing the Fourier neural operator. It incorporated spectral channel compression to extract most learned information and kept from information loss during layer cascading.
Considering the novelty of current research, FNO has previously used for learning mappings between function spaces considering parametric partial differential equations and has never been utilized in video domain for representation learning. It is the first time, to the best of our knowledge, employing FNO in video domain for spectral representation learning. Learned spectral features are further concatenated with the spatial, temporal, and text features (from ground truth annotations of the same video) of the same visual volume. Memory augmented recurrent transformer, adopted from MART [9], exploits all these learned representations and generates a multi-sentence (paragraph-like) textual description of the provided video.
The Fourier transform to converts a signal f(x) from the time domain to the frequency domain follows: VOLUME 11, 2023 where F(u) is the frequency domain representation of the signal,u is the frequency, and i is the imaginary unit. The inverse Fourier transform of F(u) to convert back a signal from the frequency domain to the time domain follows: where f (x) is the time domain representation of the signal. Applying the Fourier analysis to a video, the above stated equations can be extended to two dimensions to account for the spatial dimensions of the video frames. The Fourier transform of a video frame f (x, y) is given by: where F(u, v) is the frequency domain representation of the video frame, u and v are the spatial frequency variables, and x and y are the spatial variables. The inverse Fourier transform of F(u, v) is given by: where f (x, y) is the time domain representation of the video frame, we analyze each selected frame of a dashcam driving video by applying Fourier analysis in order to extract spectral features such as the frequency distribution of color and intensity variations over time. In conclusion, we introduced a spectral neural network with an implementation of a spectral block with a skipping override of spatial convolution fused at the output of each layer. The spectral block accepts a spatial input of 224 × 224 frame size with a sampling rate of two frames per second, intentionally kept the same as the other two models employed for spatial and temporal representation learning. We employed a sliding video block of 64 frame-sets at each iteration and slid through the video to extract all the features over a volumetric data input. We obtained 1,024 features from the spectral model. The spectral Block can be further divided into three sub-blocks.
A brief explanation of sub blocks in spectral block is explained here. The first sub-block deals with the Fourier transform of signal from the spatial domain to the frequency domain following: where F(k) is the Fourier transform of the signal f (x) at the frequency k, i is the imaginary unit, and the integral is taken over all time values x. The Second component deals with the complex number multiplication of real and imaginary parts following: Moreover, the third building block is the application of inverse Fourier transform to convert the signal from the frequency domain back to the time domain following: where f (x) is the signal in the time domain, F(k) is the Fourier transform in the frequency domain, and the integral is taken over all values k and (1/2π ) is the scaling factor in ensuring the normalization of the Fourier transform. Considering filtering in frequency domain, the center of the spectrum represents a frequency of zero, also called offset. The further away from the center the higher the frequency component in the input. Keeping this in mind we can derive high pass and low pass filters. A high pass filter suppresses low frequencies and allows high frequency components. Whereas a low pass filter suppresses high frequency components and allows low frequency components. We suggest low pass filter control capturing the low frequency components.
Our proposed framework incorporates a particular feature extraction model that extracts the spectral domain features, which fundamentally captures the natural behaviors in the given video following the patterns of partial differential equations. The model learns the mathematical behaviors from the video frame and flow. The Fourier net 3D convolution model employs spectral 3D layers responsible for learning the complex domain parameters, as shown in Figure 4. We feed 64 video frames in each training example. The input is immediately passed through a fully connected linear layer to match the model's internal block size. The output of the linear layer is introduced to a repeatable spectral block. In each spectral block, Fourier convolution and standard convolution are applied in parallel for diverse representation learning. The learned representations from both convolutions are concatenated to get the super set or union of all learned representations. After concatenation, the GeLU (Gaussian Error Linear Unit) activation function is employed with linear and normalize layer application. Four such spectral blocks are utilized while configuring the Fourier Net 3D model. The F3D model outputs 1024 spectral features at the final layer. Fourier spectral block is depicted in a bubble in Figure 4.

1) DATASET FOR PRE-TRAINING FOURIER NET 3D (F3D)
We trained the Fourier net 3D model on the UCF101 [74] video action recognition dataset, a popular dataset for action recognition in videos and a benchmark for evaluating action recognition algorithms. The dataset contains 13,320 videos of 101 action categories. Each video in the dataset is annotated with a single action category label. It is split into three parts, training, validation, and testing. The training part of the dataset comprises 9537 videos, validation contains 3783, and the testing part contains 3784 videos. The UCF101 dataset is widely used as a source of pre-training for action recognition systems. Such systems can learn more generalizable features suitable for fine-tuning. Like in our research, we pre-train the model on UCF101 and then fine-tuned it on our specific dashcam driving video description dataset, DeepRide.
After training the Fourier Net 3D model on UCF101, we obtained the weights at minimum loss. We applied early stopping at a minimum validation loss to avoid over-fitting, and the best weights were saved and then loaded to extract video features.
During the training, the model is configured for 101 video action categories for the UCF101 dataset. As stated above, a standard configuration recommended by the UCF101 was used for train, validation, and test split. We trained the model using the train test split method. However, as a standard practice for feature extraction, we skip the final classification linear layer. We remove the final classification layer to consume the extracted features directly. Figure 6 shows the video's spectral feature extraction setup employing the proposed F3D model. The training history of Fourier Net 3D on the UCF101 dataset is plotted in Figure 5, depicting training and validation loss, early stopping checkpoint, and best loss observed. Table 1 lists the simulation parameters adopted during the F3D model training.

2) EVALUATION METRIC FOR FOURIER NET 3D (F3D)
The cross-entropy loss function is used as the evaluation metric for Fourier Net 3D (F3D) and is defined as where L is the cross-entropy loss, stands for the summation over all classes, y represents the true class label and p is the predicted class probabilities produced by the model. Figure 5 depicts training and validation loss, early stopping checkpoint, and best loss observed.

B. TRANSFORMER MODEL FOR VIDEO DESCRIPTION
We proposed a novel Fourier-based video description framework by employing three modes of video features, i.e., spectral, spatial, and temporal. Threefold multi-mode video representations are introduced to extract features contrary to more than conventional feature extraction approaches. We extract spatial features using a pre-trained ResNet [75] on Kinetic600 [76] with an input size of 224 × 224 with 64 frames employing a sampling rate of two frames per second. It resulted in 1024 spatial feature extraction. Further, we employed the PWC network [21] to extract optical flow features. The input of the optical flow network is set to 224×224 with a 64-frame block sliding at a rate of two frames per second. We obtain 1024 optical flow features out of the PWC network.
We employ ResNet for spatial features and PWC for optical flow from their standard implementations [38], and we introduced a spectral feature model using Fourier transform layer as explained in section III-A. We developed the spectral representation learning mechanism. We performed experiments to explore and analyze the effectiveness of the extracted spectral features. These learned representations can capture natural behavior embedded in the video actions and movements. It learns spectral features from dashcam videos in various scenarios, i.e., different light intensities. The overall video description transformer architecture, with the fusion of spectral, spatial, optical flow, and text features, is depicted in Figure 3.
We employ a recurrent transformer model extended from memory augmented recurrent transformer (MART) [9] for coherent description generation for the video. The transformer model is configured to use a shared encoder-decoder model with 12 multi-head attention and a hidden layer size of 768. A positional encoder provides the positional vector before it is fed to a multi-head attention block. The model employs a memory module with a recursive transfer approach. The core of the transformer architecture is based on scaled dot-product attention. As per query matrix Q, key matrix K, and value matrix V, the attention-based output is computed as: where Softmax( * , dim = 1) demonstrates the application of Softmax on the second dimension of the input sequence.

1) DATASET FOR TRANSFORMER MODEL
We benchmark our proposed Fourier-based video description transformer exploiting spectral representations on the Deep-Ride dataset and validate our framework. Compared to other available description datasets [77], the dataset is the most significant driving video description dataset, with 16,000 videos with an average of 68 words per description. The description was generated for a large-scale dashcam video dataset, i.e., BDD100k [78], that contains 100k dashcam videos in various demographics, i.e., cities, weather, lighting conditions,  buildings, traffic, trees, and a variety of other situations. The dataset is balanced and a good source for validating larger text description generation models. The DeepRide dataset train-test split statistics are shown in Table 2.

2) TRAINING THE TRANSFORMER MODEL
We train the transformer model for 50 epochs with early stopping for the best model selection. Early stopping patience is set to be ten epochs. We employ an Adam optimizer with five epochs for a warm-up and a learning rate of 1e − 4. The training approach is configured to be greedy decoding with a loss measurement of CIDEr. Table 3 lists the simulation parameters adopted during the training.

3) WORD EMBEDDINGS
We employed the Global vector [79] for word embeddings, i.e., Glove-6B with 300 dimensions. We generated a vocabulary index for the language model. The embedding model generates 300 features at the output layer that are used for the input of the transformer model.

4) EARLY STOPPING
Determining the number of epochs that should be run while training a deep learning model is significantly challenging. Insufficient epochs will likely delay the convergence of the model, whereas excessive epochs will result in the model being overfitted. By implementing early stopping, we can reduce overfitting without compromising model accuracy.
The early stopping, a regularization, during the training makes sure that algorithm will stop the training as soon as there are no further loss improvements. We set a patience of 10 epochs for early stopping. We save only the best weights during our training process.

5) EVALUATION METRICS
We evaluate the proposed framework with famous generated language accuracy measurement metrics. For dense video captioning, Bilingual Evaluation Understudy (BLEU) [80], Consensus-based Image Description Evaluation (CIDEr) [81], Recall-Oriented Understudy for Gisting Evaluation(ROUGE) [82], Metric for Evaluation of Translation with Explicit ORdering(METEOR) [83]. Implementation of standard evaluation metrics is sourced from the MSCOCO server.

IV. RESULTS AND DISCUSSION
We evaluate the proposed framework using standard evaluation metrics, i.e., METEOR, BLEU4, CIDEr, and ROGUE. The proposed model outperformed the existing algorithms.
The following sub-sections demonstrate our experimental results with a valuable discussion of the achieved results.

A. PERFORMANCE EVALUATION
We evaluate the performance of the proposed framework on the DeepRide dataset and collect the observations from the experiments.  Figure 7, showing the BLEU@4 score computed during each validation step. The plot demonstrates a clear superiority of the proposed approach, i.e., the combination of spectral, spatial, and temporal features contributes to the score right from the    beginning and has always been better than the other representation combinations. We performed experiments for all possible combinations, i.e., spatial, temporal, spectral, spatial + temporal, spectral + spatial, spectral + temporal, and spatial + spectral + temporal. The training performance plot shows that employing all three modes together outperformed.
We adopted the base model as provided in the DeepRide dataset paper, and to demonstrate the efficacy of our proposed algorithm, we compared it with Masked Transformer [41], Transformer-XL [29], and the MART [9]. The performance evaluation demonstrated in Figure 8 helps us understand that spectral features are not a subset of either spatial or temporal features and hence contribute to the performance of video description as the third key video information source.
Considering a single feature for video description, having a lower score than the standard spatial score indicates that spatial features are not a subset of spectral features and vice versa. Similarly, temporal features are not a subset or superset of spectral or spatial features. This finding is an exciting study and opens new doors to gather even more features that could have been missed by the spatial convolution and optical flow (temporal) feature extractors. Figure 8 depicts a plot of all evaluated metrics under discussion as a bar chart. The plot demonstrates that spectral (F), spatial (S), and temporal (T) are best when fused. Individual performance is, however, considerable but not superior. Therefore, from our experiments, we can safely conclude that our proposed approach contributes positively to the video description framework. The proposed framework demonstrates a better performance than that of baseline work to generate dense video descriptions.
Since the spectral features are the patterns of partial differential equations within the spatio-temporal representations of the video, these are not the same as the information understood by conventional approaches.

B. ABLATION ANALYSIS
We performed an ablation analysis considering the representations discussed in Section IV-A and collected detailed experimental results for all four evaluation metrics. We plot the validation performance for BLEU@4 for the various representation combinations in Figure 8 to analyze and demonstrate the collected results during the experiments. Detailed ablation analysis for all features combination is demonstrated in Figure 9.
As shown in Table 4, the proposed approach demonstrated superior performance over existing approaches; however, there could be certain limitations while adding spectral feature extraction to the framework for video description. The Fourier-based neural networks are not available out of the box and are not mainstream models. Therefore, we developed a custom model intending to create a dense video description framework. We have trained the feature extraction model, i.e., Fourier 3D Net (F3D), over UCF101, which is the respectively smaller dataset to learn the most out of the video, however training the model over kinetics dataset or other big datasets can demonstrate even much better and learn unique features, which can help extract better and unique information from the videos and can then generate rich and diverse textual narrations using natural language.

C. QUALITATIVE ANALYSIS
In this paper, we introduced the novel Fourier-based spectral representation learning and fusion of these learned features with spatial and temporal features. Further, we developed a memory-augmented recurrent transformer architecture for accurate and diverse dashcam driving video description generation employing these learned representations. We evaluated our proposed Fourier-based video description transformer exploiting spectral representations on the DeepRide dataset. Figure 10 and Figure 11 depict the qualitative results of the proposed video description framework on the DeepRide dataset with ground truth and generated dense video descriptions, i.e., the whole paragraph for an exclusive dashcam video describing the static scene including parked cars on the roadside, trees, signboards, high-rise buildings and dynamic content covering turning of vehicles, accident occurrence, lane changes, switching of traffic signals at the intersections and passing under or over the bridges. Figure 10 and Figure 11 demonstrate sample frames of a dashcam videos from the test set of the DeepRide dataset with ground truth annotations and generated descriptions with the feature composition of spectral, spatial, and temporal (F+S+T) and spatial, and temporal (S+T). In ground truth descriptions, blue colored sentences are not captured by either feature composition, whereas red colored sentences are inaccurately described by both compositions. The green colored sentences are those accurately captured and described by (F+S+T) composition but could not be described by (S+T) feature composition validating the efficacy of the proposed framework.
Furthermore, the performance of the single features set, or their fusion, can vary based on the nature of the video description task, i.e., dashcam video-based trip description generation in this research work. The temporal features extracted from the scene are informative when capturing the dynamic aspects considering the vehicle's movement, pedestrians, and other road conditions. Similar is the case with spatial feature extraction, where better contextual information and object relationships can be learned from each frame of the dashcam video. Further concatenation of these two learning with spectral representations helps in describing the more visual aspects resulting in boosted performance. The choice of feature composition depends on the nature of the task and the characteristics of the employed dataset. Moreover, ablation with multiple feature compositions can help in selection of the most discriminative aspects of each modality and improve the performance of the employed model.

V. CONCLUSION
Several representation learning and fusion algorithms work fine for videos and primarily utilize convolutional neural networks to extract and aggregate most visual and text features. These features commonly include spatial, temporal, and semantic representations. However, learning the PDE (Partial Differential Equations) or nature-based qualities of the videos and images is overlooked by standard convolutional procedures. This research studied the significance of natural physics-based feature presence in video data volume for video description generation by examining the patterns in the distribution of colors or intensities across different frequencies. Our experiments proved that even though CNNs extract almost all known visual features, adding physics-informed features enhances the overall representation learning process for video content. Hence the fusion of spectral features with already incorporated spatial and temporal features in transformer architecture improves the quality and accuracy of the generated descriptions, as evidenced through quantitative and qualitative evaluation. It enhances the visual information fed to the transformers and boosts the text and video feature mappings, resulting in a state-of-the-art video-to-text architecture.