A Video Question Answering Model Based on Knowledge Distillation

: Video question answering (QA) is a cross-modal task that requires understanding the video content to answer questions. Current techniques address this challenge by employing stacked modules, such as attention mechanisms and graph convolutional networks. These methods reason about the semantics of video features and their interaction with text-based questions, yielding excellent results. However, these approaches often learn and fuse features representing different aspects of the video separately, neglecting the intra-interaction and overlooking the latent complex correlations between the extracted features. Additionally, the stacking of modules introduces a large number of parameters, making model training more challenging. To address these issues, we propose a novel multimodal knowledge distillation method that leverages the strengths of knowledge distillation for model compression and feature enhancement. Speciﬁcally, the fused features in the larger teacher model are distilled into knowledge, which guides the learning of appearance and motion features in the smaller student model. By incorporating cross-modal information in the early stages, the appearance and motion features can discover their related and complementary potential relationships, thus improving the overall model performance. Despite its simplicity, our extensive experiments on the widely used video QA datasets, MSVD-QA and MSRVTT-QA, demonstrate clear performance improvements over prior methods. These results validate the effectiveness of the proposed knowledge distillation approach.


Introduction
Video question answering (QA) [1] is a research task that evaluates a computer's ability to efficiently process video information through question answering. It is similar to earlier tasks, such as visual question answering (VQA) [2] and text question answering (TQA) [3]. Video QA has gained significant attention from researchers since its proposal. In this task, as shown in Figure 1, a video and a set of questions related to the video are provided, and the machine is expected to analyze the video content and understand the question content in order to provide accurate answers.
The video QA task presents unique difficulties and challenges that are not encountered in general QA tasks, placing higher demands on the model. Firstly, the nature of the complex questions in video QA requires a comprehensive understanding of various aspects of the question, including the way it is framed, its purpose, and the specific focus of the video. This complexity necessitates a thorough understanding of all question elements. Secondly, video processing introduces temporal dynamics that are absent in static images. For example, in action-based questions such as "What are men doing?", understanding the actions often requires analyzing a sequence of frames rather than a single static image. The machine needs to observe information within each frame, identify targets, analyze Although today's video QA task has received extensive attention from the academic community compared with the earlier visual QA and text QA, its research status still has many deficiencies. With respect to feature extraction, studies always used the pretraining model to extract the appearance and motion features to represent the video, while using the word vectors to represent the text [1,4,6]. These are then input into two separated streams to obtain the latent representation which is finally fused into a visual representation. In the interaction and fusion between video and problem text, recent works have mainly used attention [7,8], graph convolution for feature enhancement and object-relational reasoning [5,6,[9][10][11]. These methods can effectively extract the important frames and objects of interest in the video, and reason with the guidance of the question. However, they lack interaction and ignore the latent complex correlation between the appearance and motion of the video. This paper introduces a novel video QA model that addresses the limitations mentioned earlier. The proposed model enhances the fusion between the appearance features and motion features while also compressing the overall model size using knowledge distillation techniques [12]. The approach starts by training a teacher model, which serves as a reference model. Based on the knowledge learned by the teacher model, a relatively simpler student model is constructed and trained to improve the overall performance. This approach reduces the number of trainable parameters in the model, making its volume less while maintaining or improving its performance. Importantly, the proposed model focuses on capturing the latent complex correlations between the appearance and motion of the video, which strengthens the feature fusion process. The knowledge obtained from the multimodal fusion in the teacher model is distilled and used for uni-modal learning in the student model. As a result, the student model can leverage rich multimodal information during the process of uni-modal training. This early-stage multimodal interaction enables improved fusion effects between the appearance and motion modalities. In summary, the proposed model not only compresses the overall model size but also emphasizes the latent complex correlations between appearance and motion in videos, leading to enhanced feature fusion. By leveraging knowledge distillation and a student-teacher model framework, the proposed approach achieves improved performance while reducing model complexity. The main contributions of our work can be summarized as follows: (1) Teacher-student framework: We introduce a teacher-student framework leveraging knowledge distillation techniques. This framework allows for the training of a simpler student model in a more convenient and efficient manner. By distilling the knowledge learned by the teacher model, the student model benefits from the expertise while maintaining a reduced model size.
(2) Multimodal knowledge distillation: We propose a novel approach to multimodal knowledge distillation. This technique enables the student model to acquire rich multimodal information during the training process of individual modalities. By incorporating multimodal interactions early on, the fusion of appearance and motion features is significantly enhanced. (3) Competitive results on MSVD-QA and MSRVTT-QA: Through extensive experiments, we demonstrate the effectiveness of our proposed model on the popular MSVD-QA and MSRVTT-QA datasets. Our model achieves competitive performance compared to existing approaches, showcasing its capabilities in video question answering tasks.

Related Work
The video QA task was introduced later than the general QA task and its development has been relatively slow due to the challenges in collecting video QA data and the complexity of video semantic analysis. However, with the continuous construction and improvement of datasets and the advancements in deep learning technology, research on video QA tasks has made significant progress. Various modeling methods have emerged, attracting significant attention from the academic community. The video QA datasets primarily fall into three categories: film and television, real-life, and generated datasets. Film and television datasets comprise video clips from movies and TV shows, including datasets such as MovieQA [13] and TVQA [14]. Real-life datasets consist of videos that capture daily life scenes, making them more applicable in practical scenarios. An example of such a dataset is LifeQA [15]. Generated datasets involve automatically generating videos with different virtual geometric objects. For instance, the SVQA dataset [16] contains videos generated using the Unity3D tool. Video QA datasets take various forms, including video retrieval, selection, filling in the blanks, and other methods. The answers in these datasets are generally predicted through classification techniques.
In recent years, research on the video QA task has been advancing steadily, and a common solution can be abstracted into a basic video QA framework. This framework comprises video feature extraction, question feature extraction, multimodal fusion, and final answer generation. Video feature extraction typically involves extracting static appearance features and dynamic motion features. For static appearance feature extraction, a common approach is to utilize pretrained models on ImageNet [17]. Network models such as VGG and ResNet [18] are commonly employed for this purpose. Dynamic motion features are typically extracted using pretrained models trained on the Kinetics dataset [19]. The C3D model [20] is a popular choice for extracting dynamic motion features. Subsequent research has sought to enhance the performance of the model by refining each module. Various improvements and optimizations have been explored by modifying the details of each component in the video QA framework.
For the extraction of textual features, pretrained word vectors are primarily utilized to encode each word, representing them as fixed-length vectors. Common techniques include Word2Vec, Glove [21], and BiLSTM. In the context of video QA tasks, the research has focused on the interaction and fusion of video appearance features, video motion features, and question text features. Various implementation methods have emerged, such as attention mechanisms, graph convolution networks, and more. Jang et al. [22] employed attention in both temporal and spatial dimensions to fuse video and question features, identifying crucial areas in key video frames and classifying their resulting features. Kim et al. [23] introduced memory mechanisms to enable the model to learn deeper representations and the meanings of features. Xu et al. [1] proposed an attention memory unit (AMU) based on dynamic memory network (DMN) principles, continually improving video feature attention through text-based cues. Gao et al. [4] considered the correlation between appearance and motion features, proposing the co-memory network method and utilizing dynamic attention to learn video features. Zhang et al. [7] explored convolutional approaches instead of recurrent neural networks and proposed hierarchical convolutional self-attention networks (HCSA), incorporating attention mechanisms at each stage to continuously focus on problem-related information. While these methods primarily employ attention and memory mechanisms to represent learning, they often overlook object relationships and may have limitations in reasoning. Le et al. [24] introduced the conditional relationship network (CRN) to model relationships between visual objects, but it may be less efficient in dealing with multiple object relationships. With the emergence of graph convolution networks, Wang [5] and others leveraged this approach to perform object-relationship reasoning in behavior recognition tasks, effectively improving the learning effects between objects and relationships. Consequently, it has gained widespread use in video QA tasks. Jiang et al. [9] proposed heterogeneous graph alignment (HGA), employing fusion alignment features of the problem and video as graph nodes to perform graph convolution operations and infer relationship representations within and between modalities using an undirected heterogeneous graph. Huang et al. [10] proposed the location-aware graph convolution network (LGCN), which combines time embedding and position embedding and considers the interaction between objects in each frame.
Finally, when generating answers in video QA tasks, the common approach is to employ classification. Based on the probability distribution of each candidate answer calculated within a predefined set of possible answers, the cross-entropy loss function is utilized to compute the loss during training. During validation, the predicted answer is determined as the one with the highest probability.

Materials and Methods
In the video QA task, the goal is to classify the correct answer to a given question by comprehending both the video content and the question itself. The answer choices are predefined and form a fixed set of possible answers. To tackle this task, our proposed approach utilizes a knowledge-distillation-based video QA model. This model acts as an answer classification model, taking multimodal features, including videos and questions, as its input.
This paper models the overall framework of the teacher model according to Du-alVGR [6], which is shown in Figure 2. Based on this foundation, to compress the model and leverage the abundant multimodal knowledge of a larger model to enhance the feature learning process of a smaller model, this paper introduces a multimodal knowledge distillation approach to further enhance model performance. The teacher model and the student model share the same model structure, with only a slight difference in the number of graph layers. The teacher-student training structure of this approach is illustrated in Figure 3. The teacher model consists of two separate stacking modules for the appearance and motion modalities, which are trained individually. Through experimental adjustments of parameters, an optimal teacher model is obtained. Subsequently, the student model is constructed with fewer stacking modules, resulting in a simplified model. The knowledge distillation process involves transferring the fused visual features from the teacher model as "soft labels." These soft labels serve as guidance for the student model's learning of the appearance and motion features, respectively.

Encoder
In the visual encoder module, capturing both the static spatial features and dynamic temporal features of the video is essential due to their spatio-temporal nature. To extract the appearance features, we employ the pretrained model ResNet-101 and process each video clip using BiLSTM. The resulting static appearance features are denoted as V a . For the motion features, we utilize the pretrained model ResNeXt-101. The extracted dynamic motion features are denoted as V m .
In the text encoder module, we aim to capture the word feature representation of the question sentence. To achieve this, we utilize the pretrained GloVe word vectors to encode and represent each word in the question. This process results in obtaining the word feature of the question, denoted as Q w . Furthermore, to extract both the contextual features of words and the semantic features of the entire sentence, we employ two BiLSTMs. One LSTM focuses on extracting the embedded features of each sentence, denoted as Q e , while the other LSTM focuses on capturing the semantic features of the sentence, denoted as q. Figure 2. This is the overall framework of the teacher model, which includes four modules. First, an encoder extracts and represents the video and question. Then, visual-text interaction achieves the reasoning between the visual and textual features. The visual fusion is used to fuse the appearance and motion feature to the visual representation. Finally, answer generation predicts the answer using a decoder.

Visual-Text Interaction
To emphasize the importance of certain words and downplay the significance of others in the word embedding of the text, the model incorporates the attention mechanism. This allows for the optimization of features by assigning attention scores to each word. These scores are calculated using the embedded feature Q e as weights, and the word features Q w are then weighted and summed to obtain the overall representation of the question, denoted as Q att . The calculation formula is as follows: where W 1 and W 2 are learnable parameters. In a video, there are multiple clips, and, when answering a question, the question may only pertain to a subset of these clips. The model needs to focus on understanding these specific clips in order to answer the question accurately, without considering the entire video. To achieve this, question-guided attention is applied to each clip of the video, determining the attention score. This allows the model to prioritize certain clips while appropriately disregarding others. This module follows the same processing steps for both the appearance features (a) and motion features (m) of the video, which are uniformly represented as a/m. The attention score S a/m for each clip can be calculated as follows: where W a/m are learnable parameters of the appearance and motion features.
To capture the relationships between clips in the video and achieve a deeper feature representation, this approach draws inspiration from DualVGR [6], which combines the graph convolution network (GCN) model of GAT (graph attention network) and the concept of AM-GCN (attribute-matching graph convolutional network) for relational reasoning. GAT utilizes a multi-head graph convolution (multi-head GCN) strategy and incorporates an attention mechanism to effectively represent the relationships between nodes. In the teacher model, multiple layers of graph convolution are used to fully leverage the information obtained through graph convolution. In contrast, the student model aims to compress the model and reduce trainable parameters, so it only employs a single layer of graph convolution during training. Additionally, following the idea of AM-GCN, this method not only extracts the independent features of appearance and motion, but also obtains the joint features that fuse motion information into appearance and vice versa. By applying a loss function constraint, the approach emphasizes the distinction between independent features and joint features while encouraging similarity among joint features. This module processes appearance and motion features in a similar manner, which is why they are collectively represented as a/m.
Firstly, the visual feature V a/m , serving as the input to the graph convolution network, undergoes multiple iterations of the graph convolution operation to obtain independent and joint features. The formulas for calculating these features are as follows: where G i−1 represents the input features of the i-th layer graph convolution, and X i represents the output features of the i-th layer graph convolution. The subscript am represents the joint feature obtained by fusing the appearance information in the appearance, while the subscript ma represents the joint feature obtained by fusing the appearance information in the motion. GCN i represents the graph convolution operation at the i-th layer, where i is less than or equal to g.
In each layer of GCN, the input feature G is passed through k graph convolution heads for processing, and the results of each head are concatenated. Firstly, the input feature G is mapped to a dimension of d k using a linear layer, and this mapped feature is denoted as the visual feature g for further processing by the graph convolution heads. Each g j is then multiplied by the visual attention score S to obtain the visual feature under attention, denoted as h j . Additionally, in order to effectively represent the relationships between video clips, g is transformed into an undirected and fully connected graph g . The attention mechanism is utilized to obtain the weights of the relationships between each pair of nodes, denoted as β j . The formula for obtaining β j is as follows: where Graph represents the operation of constructing a fully connected undirected graph, and W is a learnable parameter. Then, the feature x j of each segment is obtained by weighting the relationships between each clip. Finally, the features x j obtained from the convolution of multiple graph convolution heads are concatenated, resulting in the output feature of the layer graph denoted as X. With the above process, we obtain the independent feature C a and joint feature C am of appearance, and the independent feature C m and joint feature C ma of motion.

Visual Fusion
In the process of fusing visual features, an attention mechanism is used to emphasize the importance of independent features and joint features. This results in the fused visual features F a , which can be represented by the following formula: F a = α a C a + α am C am (11) where W 1 and W 2 are learnable parameters, resulting in the final fused appearance feature F a . Similarly, the fused motion features F m are obtained using the same process. At the same time, to address the issue of vanishing gradients during back-propagation, this method employs the residual connection strategy [18] to obtain the final visual feature V a/m that incorporates the relationship information. The formula for obtaining V a/m with residual connection is as follows: Then, the visual appearance features and visual motion features are fused. This method utilizes the multimodal factored bilinear pooling [25] approach to obtain the fused visual features for each video clip, denoted as V c . Furthermore, drawing inspiration from the graph readout operation [26], the fusion feature representation V all for the entire video is obtained.

Answer Generation
This module integrates the visual features with the semantic features of the question, decodes these features, and generates the final answer. The model employs a standard decoder to decode the answer, resulting in the final feature p used for answer generation. The formula for obtaining p is as follows: where W 1 , W 2 and W 3 are learnable parameters, and n is the number of answer categories.

Student Model
This method employs a multimodal knowledge distillation approach. It distills the knowledge obtained from multimodal fusion in the teacher model and utilizes it for unimodal learning in the student model. By doing so, the student model can leverage the rich multimodal knowledge of the teacher model during unimodal training, allowing for early-stage interaction and fusion between multimodal models. This approach aims to enhance the effectiveness of multimodal fusion in the later stages, leading to improved model performance. Moreover, this method not only reduces the model's complexity but also enhances the learning capabilities of multiple modal features.
As previously mentioned, the teacher model utilizes multi-layer graph convolution to iteratively process visual features. This approach enables the model to perform multistep reasoning on video relationships and extract relationship features effectively through multiple graph convolution layers. On the other hand, the student model, designed for compression purposes, employs only one graph convolution layer to process visual features, resulting in a smaller model size. During the training stage, this method leverages the knowledge obtained from multimodal fusion in the teacher model to guide the unimodal learning of the student model. This guidance aims to optimize the learning process and improve the overall performance of the student model. The training details of the student model can be found in Figure 3.
The method of knowledge distillation is based on the teacher-student framework. In this work, the teacher model trains its appearance and motion modal features separately using several stacking modules. Through experimentation and parameter adjustments, the optimal teacher model is obtained. Subsequently, the student model is constructed with fewer stacking modules, making it a simpler model. The knowledge distillation process involves transferring the fused visual features from the teacher model, referred to as "soft labeling", to guide the learning of the appearance and motion features in the student model. More specifically, the fused feature V c from the teacher model is distilled into knowledge and used to guide the learning of the appearance feature V a and the motion feature V m in the student model. The details of the loss function calculation will be described in the next section. This approach allows the student model to learn from the knowledge of the other modality during the unimodal processing of appearance or motion, enhancing the interaction and fusion between modalities. Additionally, this compensates for the reduced number of graph convolution iterations, ensuring sufficient information extraction.

Loss Function
In the vision-text interaction module of the model, the objective is to learn both the independent knowledge of each visual modality (appearance and motion) and the knowledge associated with the other modality. To achieve this, four features are generated in the graph convolution of each layer for appearance and motion: the independent features G a and joint feature C am for appearance, and the independent feature G m and joint feature G ma for motion. The model aims to have a significant difference between the independent and joint features, while keeping the difference between the joint features small. To achieve this, the method draws inspiration from DualVGR and employs the Hilbert-Schmidt independence criterion (HSIC) [27] to constrain the differences between features at each layer, resulting in the matrix distance L 1 . Additionally, a similarity constraint mechanism using a matrix distance method is applied to ensure similarity between the joint features, resulting in the matrix distance L 2 . By incorporating these constraints, the model can effectively extract both the information specific to each visual modality and the information associated with other modalities, enhancing the representation of multimodal interactions.
The cross-entropy loss function is used to calculate the loss of the predicted probability distribution and the real label. The formula is as follows: where y represents the predicted probability distribution, and z represents the real probability distribution.
To sum up, the loss function of the teacher model can be finally denoted as follows: Among them, the coefficients γ and η represent superparameters, which can be adjusted to optimize the model. The student model includes the loss mentioned in the teacher model, L T , L 1 and L 2 . Additionally, it also includes the loss of distillation knowledge to learn from teachers. In order to learn the knowledge of other modals in the early stage of uni-modal learning, this method combines the teacher model with the feature V c . Its knowledge is distilled to guide the appearance feature of the student model V a and the motion feature V m to improve the interaction and integration between the various modals. First, the Softmax activation function is normalized at the appropriate temperature T, and then the cross-entropy is used to calculate the loss. The loss function is as follows: where L 0 represents the cross-entropy loss calculation operation, V t c is the fusion feature of the teacher model, V s a/m is the appearance and motion feature of the student model, T a/m is the temperature of knowledge distillation for appearance and motion. The temperature is used to justify the soft label's distribution. Its distribution is smoother while the temperature is higher. The weight is used to justify the knowledge distillation's influence on the total loss. Therefore, the total loss of the student model can be denoted as follows: where the coefficients γ and η directly use the coefficients of the teacher model, λ a and λ m is a superparameter, and the model is optimized by adjusting them.

Dataset
This method selects the widely used video QA dataset, MSVD-QA [28] and MSRVTT-QA [29] for the experiment.

Teacher Model
In the video preprocessing stage, each video is divided into a fixed number of equally spaced clips. For the MSVD-QA dataset, the number of clips is set to 8, while for the MSRVTT-QA dataset, it is set to 16. Each clip consists of a certain number of image frames, where the number of frames per clip is 16. If a clip contains fewer than 8 frames at either the beginning or the end, it is padded with the first frame or the last frame to reach the required length. The dimension size of the model, denoted as d, is set to 768. This represents the size of the visual and textual feature vectors used in the model. In the encoder module, a bidirectional LSTM (BiLSTM) is employed for both visual coding and text coding. The BiLSTM is configured with a single layer. In the vision-text interaction module, the number of layers of graph convolution, denoted as g, is set to 4 for the MSVD-QA dataset and 6 for the MSRVTT-QA dataset. Additionally, the number of graph convolution heads, denoted as k, is set to 4. In the visual fusion module, the number of factors, denoted as f , is set to 4 for the multimodal factorized bilinear pooling.
In the loss function, the constraint loss coefficients γ and η of the independent and joint features are set to 100 and 1 × 10 −6 , respectively. The optimizer used in the training process is the Adam optimizer, the learning rate of training is set to 1 × 10 −4 , the batch size of training data is 256, and the number of training iterations is 25.

Student Model
In the student model, the parameter configuration is kept the same as the teacher model, except for the number of layers in the graph convolution. Specifically, the student model utilizes a single-layer graph convolution operation to process the visual features, while maintaining the same parameter settings as the teacher model. By using only one layer of graph convolution, the student model achieves model compression by reducing the network size and complexity compared to the teacher model.
In the knowledge distillation loss function, the parameters for distilling the visual appearance and motion features are determined through a grid search experiment. The search is conducted to find the relatively appropriate configuration for these parameters. For the distillation of visual appearance features, the final temperature parameter T a is set to 1, indicating a standard distribution. The coefficient parameter λ a is set to 1. For the distillation of visual motion features, the temperature parameter T m is set to 0.7 for MSVD-QA and 1 for MSRVTT-QA. The coefficient parameter λ m is set to 100 for MSVD-QA and 1 for MSRVTT-QA.

Visual Analysis
Analyzing the changes in accuracy and loss during the training process can provide insights into the model's learning dynamics and performance. By visualizing these metrics, we can observe how the model progresses over time and identify any potential issues or improvements.
The loss changes of the training set and the validation set during the training process in MSVD-QA are shown in Figures 4 and 5, which, respectively, represent the real label loss, the appearance feature distillation loss and the motion feature distillation loss. The continuous decrease and convergence of the loss values for both the training set and the validation set indicate that the training process is effective. This means that the model is learning and making progress in fitting the data. Furthermore, the decrease in the distillation loss of the appearance and motion features indicates that the knowledge distillation process is effective in transferring the knowledge from the teacher model to the student model. Then, in order to further analyze the prediction performance of the model, some examples of incorrect prediction are selected from the prediction results of the model, as shown in Figure 6. The shortcomings and improvements of the model in the video QA task can be analyzed based on the errors in the model predictions. In question 1, due to the limited number of words in the word list, some words need to be represented by '<UNK>', which leads to the model being unable to understand the semantics of some words, and, thus, unable to accurately predict the answers. In question 2, the model incorrectly predicting "rabbit" instead of "bear" suggests a limitation in target recognition.
The similarity between the brown background and the color of the bear might have misled the model. In question 3, the model providing a verb "chase" instead of identifying the target object indicates a deficiency in sentence semantics understanding. Regarding the errors in question 4 and question 5, where the model's predicted answers may align with human understanding but do not match the fixed answers in the dataset, this could be attributed to the limitations of the dataset itself.

Comparative Analysis
In order to analyze and verify the effectiveness of this method in video QA tasks, the most advanced model algorithms based on video QA tasks are compared. Next, these algorithms are briefly introduced, and then the experimental results are compared and analyzed.
• Co-Mem [4]. This method is developed from the dynamic memory network (DMN) in visual QA, and is improved based on video QA. In the context memory module, the attention mechanism of appearance-action collaborative memory is introduced, and the convolution-deconvolution network, based on time-series and the dynamic fact integration method, is used to mine video information deeply. • AMU [1]. The algorithm is an end-to-end video QA model, which applies the finegrained features of the question to video understanding. It reads the words in the question word-by-word, interacts with the appearance features and motion features through the attention mechanism, constantly refines the video attention features, and finally obtains the video understanding that integrates the different scale features of the problem. • HGA [9]. The graph network is introduced into the model for reasoning learning. It constructs the video clip and the question word into the form of a graph, and carries out a cross-modal graph reasoning learning process.
• HCRN [24]. This is a stackable model of relational network modules based on clips. The relational network takes the input as a set of tensor objects and a conditional feature, outputs a set of relational information containing them, and then realizes multi-step reasoning of the relational information by hierarchically stacking the network modules. • DSAVS [8]. The answer to the question may be deduced from a few frames or fragments in the video, and the appearance and motion information are generally complementary. To this end, the author proposes a visual synchronization dynamic self-attention network, which selects important video clips first and synchronizes various features in time. • DualVGR [6]. This model is a stacked model of an attention graph inference network.
In the attention graph inference network module, the query punish mechanism is used to strengthen the features of key video clips, and then the relationship is modeled by the multi-head graph network combined with attention. The model performs multi-step reasoning of the relationship information by stacking the network module.
Tables 1 and 2 summarize the comparison of the experimental results of each model on the MSVD-QA and MSRVTT-QA datasets, which shows a competent result for our method. The observation that the proposed method surpasses the DualVGR model in terms of accuracy demonstrates the effectiveness of the approach. This method successfully improved the accuracy compared to other comparison models on the MSVD-QA and MSRVTT-QA datasets. This suggests that knowledge distillation serves as an effective technique for reducing model complexity while enhancing the performance of cross-modal information transmission and fusion, thereby improving the feature extraction capabilities of individual modalities. Overall, the results indicate that knowledge distillation not only enables model compression but also enhances the overall performance of the model. By distilling knowledge from the teacher model to guide the learning process of the student model, the proposed approach achieves improved accuracy in video question answering tasks, surpassing existing models.

Ablation Study
In this method, it is mainly proposed to compress the model through knowledge distillation and strengthen the cross-modal feature learning and fusion to achieve the purpose of improving the model. In order to verify the effectiveness of knowledge distillation, an ablation study was conducted.
In MSVD-QA, the learnable parameters in the constructed 'Teacher' model include about 31.19 million parameters. This method trains the teacher model and adjusts the parameters to achieve the optimal parameter configuration of the teacher model. Finally, the accuracy achieved on the test set is 39.03%. Then, this method constructs a relatively simple student model by reducing the number of layers of graph convolution, and its learnable parameters are reduced to about 24.09 million. In the case that other parameter configurations are the same as the teacher model, this method first trains the student model separately, which is denoted as 'Student' here. Then, through the method of multimodal knowledge distillation, the teacher model guides and trains the student model, and we optimize the knowledge distillation temperature and weight. The model with extra knowledge is denoted as 'Student-kd'. The results of the whole ablation study are shown in Table 3. Similar to MSVD-QA, the experimental results for the MSRVTT-QA dataset also demonstrate this phenomenon, as shown in Table 4. The student model has poor strength to represent the visual semantics due to the less learnable parameters. However, it demonstrates excellent accuracy when guided by the teacher model, even higher than that of the teacher model. By comparing the test results of 'Student' and 'Student-kd', it can be observed that both models have the same architecture and the same number of trainable parameters. However, 'Student-kd', which benefits from knowledge distillation and the learned cross-modal features from the teacher model, exhibits higher accuracy. This suggests that knowledge distillation effectively improves the inter-modal fusion. Prior to the fusion module, the individual modalities, such as appearance, can acquire multimodal knowledge, enabling them to have a pre-inclination towards the fusion distribution. This approach helps avoid unstable fusion of the appearance and motion features.
Furthermore, comparing the test results of 'Teacher' and 'Student-kd', it becomes evident that knowledge distillation significantly reduces the model size and the number of trainable parameters, while maintaining or slightly increasing the accuracy. The prediction differences among these three models are illustrated in Figure 7. For easy questions, all three models achieve correct predictions. However, in more challenging scenarios, due to its limited parameter learning, the student model struggles to arrive at the correct answer. In such cases, the teacher's knowledge effectively guides the student in making the right choice. Notably, when faced with exceptionally difficult questions that even the teacher model struggles with, the student model, equipped with rich multimodal knowledge, surpasses the teacher and achieves accurate predictions.
These results highlight that, in this approach, knowledge distillation not only reduces the model size but also enhances the feature fusion between modalities, leading to improved performance and feature enhancement.

Conclusions
This paper introduces a novel video question answering model that utilizes knowledge distillation to address the challenge of capturing the latent complex correlation between appearance and motion in videos. While existing methods, such as attention mechanisms and graph convolution networks, enhance the attention of visual and text-specific features and reasoning about video relationships, they often overlook the interaction between appearance and motion. To overcome this limitation, our proposed approach leverages knowledge distillation to uncover the latent correlation between the static appearance and dynamic motion features in videos. By distilling the knowledge from a teacher model, our method strengthens the fusion of appearance and motion features while compressing the model. This enables the student model to learn from the rich multimodal knowledge of the teacher model, improving the interaction and fusion between appearance and motion features. The ablation experiments and comparisons conducted in our study validate the effectiveness of our approach in visual feature learning and its ability to enhance video QA performance. However, we acknowledge a limitation in our work, which is that the proposed knowledge distillation method is currently limited to the fusion of appearance and motion features in videos. As part of our future work, we plan to explore the application of knowledge distillation in other types of video features beyond appearance and motion.

Conflicts of Interest:
The authors declare no conflict of interest.