MLSFF: Multi-level structural features fusion for multi-modal knowledge graph completion

: With the rise of multi-modal methods, multi-modal knowledge graphs have become a better choice for storing human knowledge. However, knowledge graphs often su ﬀ er from the problem of incompleteness due to the inﬁnite and constantly updating nature of knowledge, and thus the task of knowledge graph completion has been proposed. Existing multi-modal knowledge graph completion methods mostly rely on either embedding-based representations or graph neural networks, and there is still room for improvement in terms of interpretability and the ability to handle multi-hop tasks. Therefore, we propose a new method for multi-modal knowledge graph completion. Our method aims to learn multi-level graph structural features to fully explore hidden relationships within the knowledge graph and to improve reasoning accuracy. Speciﬁcally, we ﬁrst use a Transformer architecture to separately learn about data representations for both the image and text modalities. Then, with the help of multimodal gating units, we ﬁlter out irrelevant information and perform feature fusion to obtain a uniﬁed encoding of knowledge representations. Furthermore, we extract multi-level path features using a width-adjustable sliding window and learn about structural feature information in the knowledge graph using graph convolutional operations. Finally, we use a scoring function to evaluate the probability of the truthfulness of encoded triplets and to complete the prediction task. To demonstrate the e ﬀ ectiveness of the model, we conduct experiments on two publicly available datasets, FB15K-237-IMG and WN18-IMG, and achieve improvements of 1.8 and 0.7%, respectively, in the Hits@1 metric.


Introduction
The continuous development of deep learning technology has had a significant impact on research in various fields.For instance, in the field of biomedicine, automatic diagnostic techniques based on deep learning have emerged, enabling image recognition and assisting healthcare professionals in the diagnosis and subsequent procedures [1][2][3][4][5].Furthermore, deep learning has demonstrated a superior performance in scenarios with larger datasets, such as multi-view clustering [6][7][8][9][10].In order to store and learn from a vast amount of information, knowledge graphs (KG) have emerged.
A knowledge graph can be conceptualized as a large-scale semantic integration network, which represents entities as nodes and relationships as directed edges; thus, it stores a vast amount of human knowledge in the form of a directed graph.The resource description framework (RDF) provides a standard framework for KG representation, wherein fact triples (head, relationship, tail) are employed to describe knowledge [11].The KG is capable of storing a rich amount of information regarding realworld entities and their relationships and can enable a range of reasoning processes across the graph.The graph-based approach to data processing has demonstrated a superior performance in tasks such as assisting information retrieval, question-answering systems, and recommendation systems, when compared to traditional structured data [12,13].However, due to the infinite and constantly evolving nature of real-world knowledge, the incompleteness of the KG has led to the task of knowledge graph completion (KGC).
In the field of natural language processing (NLP), KGC techniques can be broadly categorized into three types: rule-based models, path-based models, and embedding-based models.Rule-based models tend to retain the original semantic information more completely, and therefore offer better interpretability.Path-based models make a better use of and represent the graph structure, enabling guided reasoning through various path-searching mechanisms.Both of these approaches are more interpretable, though their expressiveness is limited by model constraints, and their spatiotemporal complexity is higher.Compared to the first two types of models, embedding-based models typically offer greater expressiveness.With the development of graph neural networks (GNNs), GNN-based models have shown great potential in various graph-based tasks, providing additional ideas for KGC.In recent years, KG has also been studied in computer vision, such as in the context of scene graphs and language and image integration.
In recent years, multi-modal knowledge graphs (MKG) have gained significant attention as an extension to traditional knowledge graphs based on a single modality.MKGs typically augment semantic KGs with additional modality data, such as visual and audio attributes, to provide more physically rich representations of the world [14][15][16], as illustrated in Figure 1.For a given entity in the knowledge graph, we can use both image and text descriptions to supplement more detailed information that cannot be captured solely by the graph structure.Unfortunately, due to the lack of accumulated multi-modal corpora, existing MKGs often suffer from more severe incompleteness compared to traditional KGs, which greatly reduces their utility and effectiveness.In the task of multi-modal knowledge graph completion (MKGC), we must consider both the issues of multi-modal information fusion and the accuracy and interpretability of knowledge graph completion.In terms of multi-modal information fusion, we need to address issues such as semantic alignment, noise reduction or attenuation, and the realization of unified embeddings.In the process of link prediction, we must not only leverage the semantic richness of multi-modal information to improve accuracy but also enhance the logicality of the algorithm and improve its interpretability [17,18].

Los Angeles
Pulp Fiction is a 1994 American crime film directed by Quentin Tarantino, who also co-wrote the screenplay along with Roger Avary.The film is known for its eclectic dialogue, ironic mix of humor and violence, nonlinear storyline, and a host of cinematic allusions and pop culture references… Despite the abundance of existing image-text embedding pre-training models, these models often focus on a single pair of corresponding images and text and fail to consider the distinctive structural features of KGs.Therefore, our research builds upon MKGs that contain image-text feature information.In addition to integrating embeddings from different modalities, we also retain local graph features and introduce path features to enhance the interpretability of the reasoning model.Specifically, we propose a method that first utilizes separate modality encoders to learn image and text embeddings, followed by an irrelevant filtering layer to further select semantically relevant key features.Next, we fuse and encode information from different modalities to obtain a multi-modal representation.We then use graph convolution algorithms and path features to extract structural features, and use a scoring function to predict missing triples.Our innovation can be summarized as follows: 1) Designed a structure for extracting image-text information through single-modality encoding, followed by interaction fusion, and improved the semantic similarity through an irrelevant filtering module, thereby enhancing the fusion understanding of different modalities; 2) Proposed a structure feature learning scheme that combines graph convolution and path embedding, thereby enhancing interpretability during the reasoning process; 3) Achieved better results on two public datasets, FB15K-237-IMG and WN18-IMG.

Knowledge graph completion
The task of knowledge graph completion has been widely studied, with typical sub-tasks including link prediction, entity prediction, and relation prediction, aimed at predicting missing triples (head, relation, tail) in the knowledge graph.Rule-based models such as AMIE and RLvLR utilize symbolic features to perform reasoning through either rule mining or rule searching algorithms [19,20].Neu-ralLP introduced dynamic programming and further optimized rule mining through attention mecha-nisms and auxiliary memory [21].Path-based models focus more on the paths between queried head and tail entities, and algorithms such as the path ranking algorithm (PRA) and random walks have been applied and further explored in such models.RNNPRA uses recurrent neural networks (RNN) to better learn path features for reasoning tasks [22].DIVA proposed a unified reasoning framework that divides multi-hop reasoning into a path search and path inference steps [23].The continuous development of deep reinforcement learning (DRL) techniques has enabled more effective multi-hop reasoning in sparse graphs.A series of models such as DeepPath and MultiHop have achieved more effective path exploration by designing new reward mechanisms [24,25].
Currently, the more mainstream methods for solving KGC problems are focused on embeddingbased models.Translation-based models such as TransE, TransR, and TransH embed entities and their relations by projection, and use a distance function to score the factual triplets [26][27][28].Tensor factorization models such as RESCAL, Tucker, and LowFER use vectors to capture latent semantics through tensor decomposition and continuously improve model efficiency while reducing model size [29][30][31].With the continuous improvement in neural networks (NN) in learning and expressing knowledge, additional embedding-based models choose to use neural network architectures to implement KGC.NTN uses neural tensor networks for relation reasoning in KG [32].ConvE learns deeper features using two-dimensional convolutional layers [33].InteractE processes more complex semantic information and KG interactions through multiple operations such as feature reshaping, feature permutation, and recurrent convolution [34].Although CNN-based KGR models generally perform better than traditional NN models, the feature information contained in the graph structure itself has not been well utilized.Therefore, GNNs have been introduced into the KGC field to perform more complex reasoning tasks based on graph structure features.RGCN encodes each entity into a vector, uses specific transformations to aggregate neighborhood information for different relationship categories, and then reproduces facts through a decoder [35].SACN uses weighted graph convolutional networks (WGCN) to implement the encoder, and then inputs the encoded information into a convolutional network for decoding [36].NBF-Net and RED-GNN improve on traditional algorithms, choosing Bellman-Ford algorithms and dynamic programming to optimize the propagation strategy in previous GNN models, and achieve efficiency improvements [37,38].

Multi-modal task
The traditional tasks in the two major fields of computer vision (CV) and natural language processing (NLP) have been extensively discussed, and more recent research has focused on cross-modal problems.The optimization and development of the Transformer model has led to a series of explorations into visual-text pre-training frameworks.VisualBERT is considered to be the first image-text pre-training model, which uses Faster R-CNN to extract visual features and connects them with text embeddings, which are then input into a transformer initialized by BERT [39].Inspired by the feature extraction and architecture in the VisualBERT model, more pre-training models have been proposed by adjusting the pre-training tasks and datasets.CLIP uses a dataset of 400 million image-text pairs for pre-training, learning representations by directly matching raw text and corresponding images [40].METER further explores single-modal feature extraction and processes multi-modal fusion using a dual-stream architecture model, achieving excellent performance on many downstream tasks [41].
Numerous excellent multi-modal pretraining models have adopted the masked language modeling (MLM), masked visual modeling (MVM), and visual-linguistic matching (VLM) tasks as pretrain-ing objectives; their corresponding downstream tasks are mainly focused on works that deal with the meaning and relationships between text and images, such as visual question answering (VQA), visual commonsense reasoning (VCR), and visual captioning (VC).However, for KGs, their distinguishing feature from semantically structured information is their graph structure.Recently, some studies have recognized the importance of structural features for handling KG-related tasks.DRAGON proposes a deep bidirectional, self-supervised pretraining method for language knowledge models from text and KGs [42].Knowledge-CLIP takes entities and relations in KGs as inputs and extracts the original features of these entities and relations [43].Entities can be in the form of images/text, while relations are described using language tokens.These pretraining models with structural features provide better options for MKG-related tasks.

Multi-modal knowledge graph completion
As an emerging research field, related work in MKGC is not yet systematic, and early MKGC tasks often directly added image information to the input of the original KGR model, which usually led to a suboptimal performance.To address this issue, many studies have made more attempts and explorations in the field of image-text feature fusion in MKG.
IKRL first proposed an attention-based neural network to consider visual information in entity images [44].TransAE introduced a KG representation learning method that integrates multi-channel (visual and language) information in a translation-based framework, and extended the definition of triple energy to consider new multi-channel representations [45].MKBE and MRCGN integrated different neural encoders and decoders with relation models to embed learning and multi-modal data for inference [14,46].MarT constructed a multi-channel analogical reasoning framework based on structural mapping theory to improve model interpretability [47].MMKGR used a unified gate attention network to perform an attention interaction and to filter noise for generating more effective and reliable multi-modal complementary feature encoding, and designed a new reinforcement learning framework to predict missing elements in multi-hop reasoning processes [16].MM-RNS proposed a multi-channel relation-enhanced negative sampling framework that provides bidirectional attention between visual and text features by integrating relation embeddings, and combined it with contrastive learning to construct an effective contrastive semantic sampler to improve MKGC performance [48].
We have conducted a brief overview of the related models in traditional and multimodal KGs, as shown in Table 1.In order to provide a clearer demonstration of the effectiveness of the aforementioned work, we have provided a more detailed comparative analysis of selected algorithms in Table 2.

Problem formulation
The knowledge graph G = {E, R, F } is a directed graph, where E is the entity set, R is the relation set, and F = {(h, r, t) |h ∈ E, t ∈ E, r ∈ R} is the fact set consisting of fact triples (h, r, t).The head entity h ∈ E and tail entity t ∈ E are connected by a relation r ∈ R. For a multi-modal knowledge graph G, the entity e includes two modalities, namely textual information e t and visual information e v .
The purpose of multi-modal KGC is to infer incomplete triplets T = {(h, r, t) |h ∈ E, t ∈ E, r ∈ R, (h, r, t) F } based on known fact triplets (h, r, t).In practice, the incomplete triplets that may appear in our prediction task can take three forms, namely (h, r, ?), (h, ?, t), and (?, r, t).In the implementation process, we input the feature information of entities e and relationships r into an encoder to obtain the corresponding embedding vectors h, r, t.Then, we use a scoring function f (h, r, t) to evaluate the probability of the truthfulness of inferred triplets.That is, when triplet (h, r, t) ∈ Gis true, f (h, r, t) scores 1, otherwise, when (h, r, t) G is true, f (h, r, t) scores 0. Taking a missing triplet in the form of (h, ?, t) as an example, let us assume the existence of a relationship r pd between the head entity h and the tail entity t, thereby obtaining the complete triplet h, r pd , t with an unknown truthfulness.To evaluate the probability of its actual occurrence, we employ a scoring function, resulting in the output f h, r pd , t .The basic terminology definitions are shown in Table 3.

Methodology
The model we proposed, MLSFF, has an overall architecture shown in Figure 2, which consists of three components: 1) single-modality encoders for image and text embedding; 2) a multi-modal feature fusion mechanism with irrelevant filtering to discard interfering information and to reduce noise when the image and text features interact with each other; 3) a reasoning framework that combines the graph structure and path features, introduces a new scoring function containing multi-hop path features, and uses multi-modal features to predict incomplete triplets in KGC processes.

Single modal encoder
The emergence of the Transformer model has caused a huge revolution in the NLP field and has been widely used in various tasks.The attempt to introduce the Transformer model into the CV field has not only achieved success, but even achieved astonishing results.Specifically when the pre-training data is large enough, Transformer's performance in CV will be significantly better than CNN, breaking the limitation of the original few inductive biases, and achieving better transfer effects in downstream tasks.We use independent image encoders and text encoders based on the Transformer architecture to extract features from the raw inputs.For a given triple, the entity and relation are sent to the corresponding encoder based on their modality (image or text).The relation represented by language tokens is sent to the text encoder similar to the text entity.The main architecture of our single-modality encoder is illustrated in Figure 3. Visual Encoder For image feature extraction, we adopt the embedding layer and Transformer encoder of the pre-trained model ViT as the main architecture [49].Let C be the number of channels in the image (in RGB images, C = 3) and the resolution of each image patch be (P, P).First, we scale the input image I to a unified resolution (A, B), and then divide it into N = AB/P 2 patches.We use a linear mapping (i.e., FC layer) to transform each patch into a one-dimensional vector.This completes the embedding of the original image X v pat .Subsequently, we feed the obtained image embedding and position embedding X v pos into the Transformer encoder as an input.The overall forward calculation process is as follows: The MSA Block consists of a multi-head attention mechanism, a layer normalization, and a skip connection (Layer Norm & Add), which can be repeated for L v times, and the output of the l−th block is Xv l .The MLP Block consists of feedforward neural network, layer normalization, and skip connection (Layer Norm & Add), which can be repeated for L v times, and the output of the l−th block is X v l .Textual Encoder In NLP tasks, a large number of pre-training models based on the Transformer architecture have emerged, such as BERT, which has recently been widely applied and demonstrated great success in various downstream tasks [50,51].In this paper, we use BERT to perform language modeling and feature extraction.Specifically, we divide the complete sentence into a word sequence and perform word embedding to obtain the word embeddings X t word .In order to preserve sentencelevel features, we also embed the entire sentence and align it with the word embeddings to obtain the sentence embeddings X t sen .Then, we send the word embeddings X t word , position embeddings X t pos , and sentence embeddings X t sen to the encoder.
The difference between text encoding and visual encoding is that layer normalization (LN) is located after the multi-head self-attention (MSA) and feed-forward network (FFN) layers.Similarly, the output of the l−th MSA block is denoted as Xt l and the output of the l−th MLP block is denoted as X t l .We denote the number of MSA and MLP blocks in the text encoder as L t .

Multimodal feature fusion
In the multimodal fusion module, we fuse the separately encoded text and image information.Specifically, since relationships belong to a separate data category with certain label information, although they are usually described using text, their semantic relevance to the text and image descriptions of entities is relatively low.Therefore, we choose to fuse and filter the image and text information separately for relationships, and then introduce the encoded relationship attributes when learning the path features.
To enhance the efficiency of the semantic interaction between the two different modalities of image and text, we adopt an intermediate representation to unify the multimodal information.On one hand, we aim to achieve a more fine-grained interaction between different modal feature information; on the other hand, since images often contain semantically irrelevant information, directly using the complete image embedding in the feature fusion process may introduce noise.Therefore, we feed the learned image and text vectors into a multimodal gated unit for weight learning to achieve the intermediate feature representation.
In this equation, σ represents the sigmoid function, X v and X t denote the feature vectors outputted by the image and text encoders, respectively, W v and W t are parameter matrices, g f is a scalar within the range of [0, 1], Xm represents the multi-modal embedding vector obtained through the filtering layer, and denotes the element-wise multiplication (i.e., Hadamard product).
Later, we feed the original embeddings Xm into the multi-modal encoder to further learn the semantic features.X = T ran Xm (4.9)

Prediction block
We have obtained the multi-modal feature embedding of a certain fact description through the previous structure, but this is insufficient for large-scale and complex KGs.Hence, we aim to further learn path features to better accomplish the task of KGC.The overall approach regarding the learning of structural features and completion can be summarized as follows.First, we extract a certain path existing in the MKG, connect the relations in the path, and then divide the path into several shorter components through a sliding window.Then, we select one of the components and use a recurrent attention unit to embed the selected component to obtain a relation vector, which is represented as a weighted combination of existing relations.We recursively merge the divided components of the path, and finally use a scoring function to determine the truthfulness of unknown triplets.The overall process of the prediction block shows in Algorithm 1.

Algorithm 1 Prediction block
Input: the path body r p Output: the score of triplet f (h, r, t) 1: Initialize the window size w 2: for all i = 1, 2, ..., n − 1 do 3: get path segments w = {1, 2, 3} and encoding with LSTM ŷi , ŷi+1 = LSTM (w i ); 4: Sliding Window Segmentation To extract fine-grained features from sampled paths, we decompose the sampled paths into combinations of different sizes using sliding windows of varying lengths.In the implementation, we use windows of size w = {1, 2, 3}.Given the window size, the generated sliding windows traverse the path body r p = r p 1 , ..., r p n .Then, we use a long short-term memory (LSTM) network as a sequence encoder to conceal the information within the sliding windows.Taking the sliding window of length 2 as an example, ŷi , ŷi+1 = LSTM (w i ) (4.10) Since the final state y i+1 usually contains the complete information of the sequence, we select y i = ŷi+1 .y i is meaningful to learn the relationship in the window if the relationship segments in the i−th sliding window always appear together in some combination, which is more likely to represent a real "long-distance" relationship.To incorporate this observation into our model, we calculate the probability value of these relationship segments by: µ = so f tmax FC y 1 , FC y 2 , ..., FC y n+1−w (4.11)where FC (•) represents a fully connected layer, which is used to learn the probability that the i−th window in y i represents a meaningful relationship fragment.Finally, we calculate the weighted sum of information from different windows to represent the complete path features: Scoring Function Considering the excellent performance of graph convolutional models in handling KGC problems, we choose the following scoring function: In the proposed scoring function, X h and X t represent the multi-modal embeddings of the head and tail entities, respectively, while Y represents the embedding of their relationship, * and ω denote the convolution operation and the convolution kernel, respectively, and vec (•) represents the projection operation from the feature map to the vector space, W is a parameter matrix.With the above method, we can compute whether a fact constructed by a certain relationship between two entities is true or not.
For ease of reference, we summarize the main symbol notations used in this chapter in Table 4.

Dataset
We evaluate the effectiveness of the MLSFF model on two publicly available datasets: (i) FB15K-237-IMG: a subset of the large-scale knowledge graph Freebase, where each entity has 10 images, and is a commonly used dataset in KGC tasks; (ii) WN18-IMG: WN18 is a knowledge graph extracted from WordNet.WN18-IMG is an extended dataset of WN18, where each entity has 10 images [52].These two datasets can be obtained as FB15k-WN18-images.Table 5 shows the statistical information of the datasets.

Settings
Evaluation Metrics: We adopted classic knowledge graph completion evaluation metrics, including Hits@k and mean rank (MR), as shown in Table 6.Table 6.Summarization of evaluation metrics.

Evaluation metrics Calculation formula Hits@k
Hits@k = i Hits@k: The Hits@k metric is defined as the proportion of true entities that appear in the top-k ranked list of entities.It is calculated as follows: where rank i represents the rank of the expected entity of the i−th incomplete fact triple.Q represents the total number of incomplete fact triples.Mean Rank (MR): Mean Rank is the arithmetic average of the individual entity ranks, defined as: Parameter Configuration To consider the model's scale and computational efficiency, we choose the ViT-B/16 pre-trained model for the image encoder.We set the embedding dimensions for both text and image to 768.The number of layers for both the image and text encoders is set to 12, while the number of layers for the modality encoder is set to 3. The graph embedding dimension is set to 200, and the batch size is set to 64.We utilize the Warmup algorithm and the ADAM optimizer to adjust the learning rate of the model parameters.The initial learning rate is set to 0.0005, and the dropout rate is set to 0.1.
Baseline Setup We selected four unimodal methods and four multi-modal methods as baselines to compare with our proposed model.The unimodal methods include the following: 1) TransE [26], a classic translation-based model that encodes entities and relationships into a linear space; 2) DistMult [53], which uses a linear neural network to encode a multi-relation graph for multi-relation learning; 3) ComplEx [54], which solves both symmetric and asymmetric relations by introducing complex methods; and 4) RotatE [55], which defines relations as rotations from the head entity to the tail entity in a complex space to achieve multi-class reasoning.The multi-modal methods include the following: (i) IKRL (UNION) [44], which extends TransE to learn about visual representations of entities and structural features of KGs; (ii) TransAE [56], which combines multi-modal encoders with TransE to achieve unified representation of visual and textual features; (iii) RSME [57], which uses a forget gate to learn about valuable images for MKG embedding; and (iv) MKGformer [52], which proposes an MKG pre-training model based on a hybrid transformer structure.

Main results
The experimental results on the two datasets are shown in Table 7, which shows that our model generally outperforms the 8 baseline methods.

Model
FB15k-237-IMG WN18-IMG Hits@1↑ Hits@3↑ Hits@10↑ MR↓ Hits@1↑ Hits@3↑ Hits@10↑ MR↓ Firstly, in all works, the scores on FB15k-237-IMG are generally lower than those on WN18-IMG.The fundamental reason is that the dataset FB15k-237-IMG is more sparse and complex than the dataset WN18-IMG, with a greater variety of relationships between different entities.In addition, our model performs better on Hits@1 than on Hits@3 or Hits@10, indicating a superior discriminative ability in predicting unknown entities.In the MLSFF model, we use two single-modal encoders to extract image and text information, followed by a multi-modal layer for interaction, which enables full learning of semantic information for entity description.We introduce a sliding window in learning the link features, which realizes "scalable" path sampling and to some extent solves the problem of complex graph structures.
Secondly, some traditional single-modal methods, such as RotatE, even outperform architectures that use multi-modal features in overall performance.This suggests that a well-designed relationship decomposition and learning rule are effective in solving complex graph problems, and fully utilizing structural features can improve prediction accuracy.Therefore, after obtaining multi-modal encoding information, our model not only uses the traditional graph convolutional method to obtain neighbor node information, but also incorporates long-distance path features and borrows from recurrent neural network structures used in processing text information to extract left and right node information from selected paths.By adding certain "vertical" features during the convolution process, our prediction model can have better interpretability.
Finally, our model achieved significant improvements of 4.8 and 1.2% on the two datasets, respectively.However, in the FB15k-237-IMG dataset, our model's MR metric results were slightly inferior to those of the RotatE model.This could be attributed to the FB15k-237-IMG dataset containing a larger number of entities and a more diverse set of relationships, resulting in a sparser and more complex knowledge graph.While our model has improved its ability to learn about multi-hop path relationships to some extent, it lacks similar operations on negative samples, as seen in the RotatE model.As a result, this has impacted the overall accuracy.Overall, the experimental results demonstrate that our model outperforms existing methods on most evaluation metrics, with even more significant improvements observed on more complex knowledge graphs.This is because the MLSFF model learns more comprehensive semantic features by fusing information from both image and text modalities, enabling more comprehensive knowledge extraction from the graph.In addition, we employed convolutional operations that capture neighborhood information and an LSTM structure that learns path-level features to achieve a more comprehensive and three-dimensional feature encoding structure for learning graph structural features, which is highly effective for processing large-scale knowledge graphs.

Ablation study
To investigate the actual effects of each component in the MLSFF model, we conducted ablation studies by removing some of the components.w/o SinE: To investigate the effect of the single-modal encoders on understanding image and text semantics, we aligned the one-dimensional vectorized image patches and text embeddings, calculated their Hadamard product, and directly fed them into the multi-modal encoder for learning.
w/o Flt: To further investigate the actual effect of the unrelated filtering layer, we also experimented with the meaning of the multi-modal fusion module by directly fusing the encoded image and text features without the unrelated filtering layer.
w/o Swin: To demonstrate the positive effect of extracting path information on learning graph structure features, we removed the sliding window encoding module and only used graph convolution operations to obtain structural embeddings.
From Figure 4, it can be seen that using single-modality encoders to extract image and text features can effectively enhance semantic understanding and better learn human knowledge, thereby promoting and improving the performance in KGC tasks.Although image features can assist in text understanding, there is still some noise interference.Filtering out irrelevant information can further enhance the fusion effect between multi-modal features and improve accuracy.In addition, when facing large-scale and complex knowledge graphs, although graph convolutional operations can already fully learn structural information and capture neighbor features, the introduction of path and rule features can further improve model interpretability and prediction ability.Specifically, when dealing with sparse graphs, simple convolutional operations may lead to a certain decrease in accuracy, and learning path features can also help improve model efficiency.

Hyperparameter analysis
Our connection prediction module is mainly implemented based on the GNN algorithm, which aggregates neighbor information into the target node and then updates the target node based on the integrated information.However, this approach is prone to the problem of over-smoothing, where the representations of different nodes tend to become similar as the number of GNN layers increases during training.To address this issue, we introduce "longer-distance" path embeddings, which combine deep features and breadth features to extract complex graph structure information.We further explore effective graph processing structures by adjusting the number of convolutional layers and the size of the sliding window.In this work, considering memory and computational capacity, we conduct experiments with sliding window widths ranging from 1 to 3. As shown in Figure 5, the model performs better when the sliding window width is set to 2.
When the sliding window width is set to 2, our model can learn more layers of graph structural features and neighbor information.When the sliding window width is too small, that is, when the number of subgraphs learned is too few, the information in the knowledge graph cannot be fully aggregated to learn the structural information of the knowledge graph.In addition, some useful high-order neighbors cannot be captured.When the number of subgraphs is too large, the node representation is overly smoothed due to excessive noise.RED-GNN: As a GNN model in the traditional knowledge graph completion task, the RED-GNN model has an algorithmic complexity denoted as O d • min DL , |F | L .In this context, D represents the average degree of the r-directed graph per layer.It can be observed that our model has a slightly higher computational complexity.This is attributed to two main factors: first, the inherent complexity of multimodal knowledge graphs; and second, the decision to incorporate a more extensive graph feature learning scheme to enhance the interpretability of paths.

Study limitation
Despite the promising results and contributions of our study, there are some limitations that should be acknowledged: While our model aims to enhance interpretability by incorporating graph features and multi-hop paths, the interpretability of the model's predictions may still be limited.Explaining the reasoning behind specific predictions or understanding the underlying decision-making processes can be challenging, especially in complex multimodal knowledge graphs.
In addition, the proposed model in this paper exhibits high complexity, which results in increased demands for computational resources and significant time consumption.Furthermore, our model does not consider the possibility of negative samples during the sampling process, which has an impact on the overall accuracy of the prediction task.

Conclusions
We propose a MLSFF model which first uses two independent single-modality encoders to obtain pre-trained embeddings for both image and text information.Then, after filtering out irrelevant information, the multi-modal features are fused to obtain a unified encoding vector.We utilize graph convolutional algorithms to learn the structural information in the knowledge graph.In addition, we introduce path-based feature information into the graph structural features to obtain richer relationship representations.Our experimental results demonstrate that our model achieves better performance in MKGC tasks.To address the issues of high complexity and the omission of negative samples in our model, we will focus on the following areas for future research: (i) designing simpler and more efficient scoring functions that are more streamlined and computationally efficient; (ii) considering negative sample interference, thereby mitigating their impact on the accuracy of the prediction task; (iii) incorporating additional modalities: to achieve a more comprehensive and diverse multimodal fusion such as numerical features and enhancing the overall performance of the model.

Figure 1 .
Figure 1.A simple multi-modal knowledge graph example.

Figure 2 .
Figure 2. Overview of our model structure.

Figure 4 .
Figure 4. Ablation on different components of the MLSFF.

6. 3 .
Complexity analysis MLSFF: Denote the entity embedding dimension as d e , the structural embedding dimension as d r , and the number of channels as T .The final output dimension for triplet encoding is denoted as m × n.The main complexity of our model can be represented as O (|E| d e + |R| d r + T mn + T d (2d m − m + 1) (d n − n + 1)).TransE: The scoring function of the TransE model is denoted as h + r − t , and as a result, its algorithmic complexity can be represented as O (|E| d + |R| d).

Table 1 .
Summarization of existing KGC models.

Table 2 .
Model performance comparison.

Table 5 .
Statistics of datasets.