ShallowBKGC: a BERT-enhanced shallow neural network model for knowledge graph completion

Knowledge graph completion aims to predict missing relations between entities in a knowledge graph. One of the effective ways for knowledge graph completion is knowledge graph embedding. However, existing embedding methods usually focus on developing deeper and more complex neural networks, or leveraging additional information, which inevitably increases computational complexity and is unfriendly to real-time applications. In this article, we propose an effective BERT-enhanced shallow neural network model for knowledge graph completion named ShallowBKGC. Specifically, given an entity pair, we first apply the pre-trained language model BERT to extract text features of head and tail entities. At the same time, we use the embedding layer to extract structure features of head and tail entities. Then the text and structure features are integrated into one entity-pair representation via average operation followed by a non-linear transformation. Finally, based on the entity-pair representation, we calculate probability of each relation through multi-label modeling to predict relations for the given entity pair. Experimental results on three benchmark datasets show that our model achieves a superior performance in comparison with baseline methods. The source code of this article can be obtained from https://github.com/Joni-gogogo/ShallowBKGC.


INTRODUCTION
Knowledge graphs (KGs) such as DBpedia (Auer et al., 2007), Freebase (Bollacker et al., 2008), NELL (Carlson et al., 2010), and Wikidata (Vrandečič & Krötzsch, 2014) are important resources for many artificial intelligence tasks including semantic search (Feddoul, 2020), recommendations (Wu et al., 2022) and question answering (Li & Moens, 2022).These KGs are composed of factual triplets, with each triplet ðh; r; tÞ denotes the fact that relation r exists between head entity h and tail entity t.KGs can also be formalized as directed multi-relational graphs, where nodes correspond to entities and (labeled) edges represent the types of relationships among entities.
Although existing KGs usually contain more than billions of factual triplets, they still suffer from an incompleteness problem, i.e., missing a large number of valid triplets (Nguyen, 2020).In particular, in English DBpedia 2014, 60% of person entities miss placeof-birth information, and 58% of the scientists have no facts about what they are known for .We introduce a pre-trained language model BERT in feature extraction manner to obtain text features of entities, thereby further improving the performance of KGC without retraining the proposed model.
. We conduct experiments on three benchmark datasets, and the experimental results demonstrate that our model achieves a superior performance in comparison with baseline methods.

RELATED WORK
Existing KGC methods can be roughly classified into four categories: translation-based models, tensor decomposition-based models, neural network-based models and pretrained language/large language-based models.

Translation-based models
Translation-based models consider the relation between a head entity and a tail entity as a translation operation in the vector space and calculate the distance between the head entity vector and the tail entity vector to measure the plausibility of a triple.Bordes et al. (2013) present the initial translation-based model TransE, which learns low-dimensional and dense vectors for every entity and relation, so that relations correspond to translation vectors operating on vectors of entities.Wang et al. (2014)

Tensor decomposition-based models
Tensor decomposition-based models use triangular norm to measure the plausibility of triplets.Yang et al. (2015) present DistMult, which considers triplets as tensor decomposition and constrains all relation embeddings to be diagonal matrices.ComplEx (Trouillon et al., 2016) extends DistMult to the complex space to better model asymmetric and inverse relations.Balažević, Allen & Hospedales (2019) present TuckER, which performs KGC based on tucker decomposition of binary tensors of known triplets.Inspired by the tucker decomposition of order-4 tensors, Shao et al. (2022) present a tensor decomposition model for temporal KGC.Zhang et al. (2024) extend tensor decomposition methods to temporal KGC.
Models of this category are proficient in capturing complex relations between entities and relations in KGs.However, as the scale of KGs grows, the computational complexity of these models may escalate rapidly.

Neural network-based models
Various neural networks have been widely explored for KGC and achieved promising performance.Dettmers et al. (2018) present a multi-layer convolutional model ConvE, which explores convolutional neural network for KGC, and uses 2D convolution over embeddings to predict missing triplets in a KG.Shang et al. (2019) present an end-to-end graph structure-aware convolutional networks model SACN that combines graph convolutional network (GCN) and ConvE for KGC.Dai Quoc Nguyen, Nguyen & Phung (2018) present ConvKB, which utilizes convolutional neural network to capture the global relationships among dimensional entries of entity and relation embeddings.CapsE (Nguyen et al., 2019) combines ConvKB with capsule network for both KGC and search personalization tasks.Schlichtkrull et al. (2018) present relational graph convolutional networks and apply them to KGC.Vashishth et al. (2020) present CompGCN, which leverages a variety of composition operations from knowledge graph embedding techniques to jointly embed both entities and relation in a graph.SHALLOM (Demir, Moussallem & Ngomo, 2021) and the prior version of the model proposed in this article ASLEEP (Jia, 2022) apply shallow neural network for KGC and achieve good performance while maintaining high efficiency.
Models of this category have significant advantages in semantic feature learning, and our proposed model belongs to this category.The main difference between them and our model is that most of them usually rely on more deeper and complex neural networks while our model employs shallow neural network, which is not computationally demanding and friendly to real-time applications.Although there are several models that use shallow neural networks for KGC, these models only use structural information and ignore the rich information contained in text.Under the premise of keeping the model as simple as possible, we consider both text and structural information for KGC.
Pre-trained language/large language-based models Pre-trained language models and large language models have received widespread attention in many natural language processing tasks, including KGC.Yao, Mao & Luo (2019a) explore the pre-trained language model BERT for KGC.StAR (Wang et al., 2021) extends KG-BERT by taking into account structural information for KGC.Yao et al. (2023) present KG-LLM, which investigates large language models, including ChatGLM (Du et al., 2022) and LLaMA (Touvron et al., 2023) for KGC.Yang, Fang & Zhou (2023) present a constrained-prompt KGC based on large language model.Zhang et al. (2023) present KoPA, which integrates pre-trained KG structural features with large language model for KGC.
Models of this category achieve great success in KGC.However, these models usually require diverse fine-tuning strategies, and mostly cost much time in training and inference.It should be pointed out that the model we proposed also uses the pre-trained model BERT.The difference from the existing model is that in order to keep the model as simple as possible, we use BERT in a feature extraction manner, that is, the parameters in BERT are not involved in training.

OUR PROPOSED MODEL Problem formulation
A KG is a type of multi-relational directed graph that typically consists of a collection of triplets in the form of ðh; r; tÞ.It can be formally defined as G ¼ ðE; R; TÞ, where T represents the set of all triplets, E and R represent the sets of all entities and relations respectively.
The objective of the KGC task is to predict missing relations in G based on the known triplets T. In other words, the aim of KGC is to develop a model that accepts a query consisting of a head entity and a tail entity, ðh i ; ?; t i Þ, and ranks all candidate relations r c 2 R to resolve the query (Lovelace & Rose, 2022).An effective KGC model should enable correct candidates to have higher rankings than incorrect candidates.

Model overview
Our proposed model ShallowBKGC takes as input an entity pair, and outputs the probability that each relation exists between the two entities.As illustrated in Fig. 1, our model consists of three key steps: (1) entity feature extraction, (2) entity-pair representation, and (3) multi-label relation modeling.The detailed calculation process of each step is as follows.

Entity feature extraction
Given an entity pair ðh; tÞ, our model extracts the features of the head and tail entities by taking into account both text and structural information.
For text information, we apply the pre-trained language model BERT (Kenton & Toutanova, 2019), which has achieved great success in multiple natural language processing tasks, to extract text feature of the given entity.Figure 2 illustrates the framework of BERT for entity text feature extraction.Formally, given the text information of head and tail entities, i.e., head entity name h text ¼ fw h 1 ; w h 2 ; …; w h N g and tail entity name t text ¼ fw t 1 ; w t 2 ; …; w t N g (since many entities lack descriptive information and introducing additional information will increase computational complexity, we only use the name information that each entity has as text information), we first add a special classification token (CLS) and a separate token (SEP) at the beginning and end of the entity name respectively to obtain the marked entity name.Then through the tokenizer we obtain the representation of the marked entity name.Finally, we put the representation into BERT to get the text features of the head and tail entities as follows: where C 2 R H is the hidden vector of the special token (CLS), which contains the features of the entire input text.Therefore, we use it as the text feature of the given entity.
For structural information, our model receives the IDs of the given head and tail entities, and extracts the structure features of them through embedding layer as follows: where h s 2 R d , t s 2 R d are embeddings of structural information corresponding to head and tail entities, respectively.And then, we integrate entity text feature and entity structure feature through average operation, and get the entity feature as, For the sake of computational convenience, and considering the consistency of tensor shapes, we intercept the first d columns of text features when fusing text and structure features of entities.

Entity-pair representation
After getting the features of entities, our model integrates head entity feature and tail entity feature into entity-pair representation through the average operation and a non-linear transformation as follows: where U 2 R kÂk is the transformation matrix, and aveðÞ denotes the average function,

Multi-label relation modeling
Since there may exist multiple relations between an entity pair, we model KGC as a multilabel learning problem.Based on the obtained entity-pair representation in the previous subsection, our model calculates the confidence scores for each relation as follows: where V 2 E jRjÂk is the collection of weight vectors for each relation.Afterwards, the sigmoid function is applied on each element of the score vector S to compute the probability of each relation to exist: where |R| denotes the number of relations.

Model training
We define the loss function using cross-entropy as follows: where y i 2 f0; 1g is the true value for relation i, p i is the predicted probability value for relation i.The loss function is optimized with Adam (Kingma & Ba, 2015), and dropout (Srivastava et al., 2014) is employed for regularization.

Evaluation metrics
We use mean rank (MR), mean reciprocal rank (MRR) and Hits@N as evaluation metrics, in which MR is the average rank of all test triplets, MRR is the average of the reciprocal ranks, and Hits@N is the percentage of test triplets that are ranked within top N.They are formally defined as follows: Hits@N ¼ jtripletðiÞ 2 Triplet test : rank tripletðiÞ Nj jTriplet test j where jTriplet test j is the number of test triplets, tripletðiÞ is the i-th triplet.
Additionally, to evaluate the model efficiency, we measure the running time of the training phase.Record the average time of three epochs of the model on the dataset, in seconds.Our experimental platform is ModelArts, and the specific configuration selected is pytorch1.8-cuda10.2-cudnn7-ubuntu18.04,and a P100 GPU (16G).

Baseline methods
We compare our model against the following state-of-the-art KGC models, including translation-based models TransE and RotatE, tensor decomposition-based models DistMult and ComplEx, neural network-based models ConvE, SHALLOM and ASLEEP, and pre-trained language/large language-based models KG-BERT, KG-ChatGLM-6B, KG-LLaMA-7B and KG-LLaMA-13B.Below we briefly introduce these models.
. TransE (Bordes et al., 2013) is the initial translation-based model that views relations as translations from head entities to tail entities on the low-dimensional space.
. RotatE (Sun et al., 2019) is an efficient ranslation-based model that represents entities as complex vectors and relations as rotations.
. ConvE (Dettmers et al., 2018) is a deep neural network-based model that applies convolutional neural network for KGC.
. ASLEEP (Jia, 2022) improves the way SHALLOM obtains entity pair representation, and is the prior version of our proposed model.

Experimental results
In this section, we compare the performance of our model ShallowBKGC with that of the baseline methods on the widely used relation prediction task.The task of relation prediction is to complete a triplet ðh; r; tÞ with r missing, i.e., to predict the missing r given ðh; tÞ.From the relation prediction results shown in Tables 3-5, we summarize our key observations in the following section.
(1) The shallow neural network-based models, i.e., our model ShallowBKGC and the baselines ASLEEP and SHALLOM, outperform the translation-based model TransE, the complex vector-based model ComplEx and RotatE, the deep neural network-based model ConvE, the pre-trained language-based model KG-BERT, and even the large languagebased model KG-ChatGLM-6B, demonstrating the effectiveness of shallow neural network for KGC.For example, compared to TransE, DistMult, ComplEx and RotatE, our model Compared to KG-BERT and KG-ChatGLM-6B, our model ShallowBKGC achieves 0.2% and 11.7% absolute improvements in Hits@1 on YAGO3-10, respectively.It is worth mentioning that the result of KG-ChatGLM-6B is lower than that of KG-BERT.This  suggests that it is not the case that the more layers a model has or the newer the technology is, the better the results will be.
(2) Comparing our model ShallowBKGC with ASLEEP and SHALLOM, we can see that the MRR, Hits@1 and Hits@3 values of ShallowBKGC on the three datasets are better than ASLEEP and SHALLOM.This indicates that it is beneficial to take both text and structural information into account for KGC.Because the main difference between our model ShallowBKGC and the baselines ASLEEP and SHALLOM is that our model ShallowBKGC combines text and structural information for KGC, while ASLEEP and SHALLOM merely rely on structural information.
(3) The RTs of our model ShallowBKGC on three benchmark datasets significantly outperform the baselines, which shows the efficiency of our model.It should be noted that in order to minimize the impact of programming differences, we use OpenKE (Han et al., 2018) to reproduce the running times of TransE, DistMult, ComplEx and RotatE.The RTs of ConvE, SHALLOM, ASLEEP and KG-BERT are obtained from their corresponding source codes.Due to permission issues, the RTs of KG-ChatGLM-6B,KG-LLaMA-7B and KG-LLaMA-13B are missing.However, from the perspective of the number of layers and parameter scale, the RTs of these models are likely to be larger than KG-BERT.More formally, the time complexity of our model is Oðd e Þ (where d e represents the dimension of entities), which is the same as that of the baselines, except the language-based baselines.From Wang et al. (2021), we can see that the most relevant language-based baseline KG-BERT's time complexity is OðL 2 t jEj 2 jRjÞ, where L t is the length of triple text, |E| and |R| are the numbers of entities and relations respectively.Additionally, the space complexity of our model is OðL e jEjd token þ jEjd e Þ, where L e is the length of entity text, d token is the dimension of entity text tokens.
(4) It is worth noting that the results of ConvE are significantly lower than those of the other models, probably because it relies on an improper pre-trained model for initialization, and is trained on entity prediction task (i.e., given ðh; rÞ predict t, or given ðr; tÞ predict h) but tested on relation prediction task.It has been demonstrated that the initialization, hyperparameter optimization, and training strategies have significant effects on prediction performance (Demir & Ngomo, 2021;Ruffinelli, Broscheit & Gemulla, 2020).In contrast, our model ShallowBKGC is as simple as possible, it does not require retraining the pre-trained model BERT, special hyperparameter optimization approach, or complex training strategy, thus minimizing model uncertainty.

Fine-grained performance analysis
To further verify the capacity of our model from a fine-grained perspective, we plot the percentage of each relation on WN18RR in Fig. 3, and report the Hits@N and MRR performance on each relation in Figs. 4 and 5, respectively.From these figures, we can observe that: (1) From the perspective of Hits@N, there are five relations, i.e., _similar_to, _member_of_domain_region, _instance_hypernym, _member_meronym, and _derivationally_related_form, Hits@1 values exceed 90%.Ten relations Hits@3 values exceed 90%.In particular, there are four relations, i.e., _similar_to, _member_of_domain_region, _verb_group and _hypernym, Hits@3 values reached 100%.Thus, the results are consistent with Table 4, which further demonstrates the effective of our model at a fine-grained level.
(2) There are eight relations with MRR values exceeding 90%.They are _similar_to, _member_of_domain_usage, _member_of_domain_region, _synset_domain_topic_of, _instance_hypernym, _member_meronym, _derivationally_related_form and _hypernym.Moreover, five of these eight relations Hits@1 values exceed 90%.This experimental result once again demonstrates the effectiveness of our model and the consistency of the results at a fine-grained level.

Ablation study
We conduct ablation studies to provide a more detailed analysis of the effectiveness of each part of our model.The models used for comparison are the following ones: (a) Table 6 shows the relation prediction results on three datasets, i.e., FB15k-237, WN18RR, and YAGO3-10.From which we can see that: (1) ShallowBKGC outperforms ShallowBKGC-Text and ShallowBKGC-Structure in all three datasets, indicating that considering both text and structural information is beneficial to KGC.This is also consistent with the results of the previous experiments.
(2) ShallowBKGC-Structure achieves better results than ShallowBKGC-Text, the main reason is that we set the entity text feature obtained by BERT untrainable to reduce the computational complexity, this sacrifices the performance to a certain extent.It is worth further explaining that from Tables 5 and 6, we can see that ShallowBKGC-Text outperforms KG-ChatGLM-6B, considering that the latter has far more parameters than the former, but the experimental results are very close, which also reflects the efficiency of our model.
(3) ShallowBKGC-Text has the shortest RT because it has the fewest parameters.The RTs of ShallowBKGC-structure and ShallowBKGC are close, indicating that our model do not spend much time fusing text and structural information.

CONCLUSION
In this article, we propose a simple yet effective BERT-enhanced shallow neural network model for KGC that jointly considers text and structural information.Specifically, given an entity pair and the text information of the entities, our model first extracts the text features of the entities by BERT in a feature extraction manner, and extracts the structure features of the entities through the embedding layer.Then the text and structure features of the head and tail entities are integrated into an entity-pair representation through an average operation and a non-linear transformation, which aims to obtain the comprehensive and rich features of the entities.Finally, based on the entity-pair representation and considering that multiple relations may exist between entities, our model calculates the probability of each relation through multi-label modeling.Experimental results on three public datasets shown that our model achieves a superior performance in comparison with the baseline methods.
In the future, we plan to (1) further study the performance of our model on two KGC related tasks, i.e., triplet classification and entity prediction; (2) extend our model to temporal KGC and link prediction in social networks tasks; (3) explore the possibilities and performance of shallow neural networks on other tasks that can be organized into triplets.

Table 1
Statistics of datasets.Entities denotes the number of all unique entities.#Relations denotes the number of all unique relations.#Train, #Validation and #Test denote the number of triplets contained in train set, validation set and test set, respectively.
Note:The best score is in bold, while the second best score is underlined.Results marked * are taken from Wang, Ren & Leskovec (2020) andDemir & Ngomo (2021), respectively.y denotes results from our re-implementation.RT is the abbreviation for running time.

Table 4
Relation prediction results on WN18RR.The best score is in bold, while the second best score is underlined.Results marked * are taken from Wang, Ren & Leskovec (2020) andDemir & Ngomo (2021), respectively.y denotes results from our re-implementation.RT is the abbreviation for running time.

Table 5
Yao et al. (2023)on results on YAGO3-10.Results marked * are taken fromYao et al. (2023).The dash (-) denotes values missing.The best score is in bold, while the second best score is underlined.