Few-Shot Relation Prediction of Knowledge Graph via Convolutional Neural Network with Self-Attention

Zhong, Shanna; Wang, Jiahui; Yue, Kun; Duan, Liang; Sun, Zhengbao; Fang, Yan

doi:10.1007/s41019-023-00230-x

Few-Shot Relation Prediction of Knowledge Graph via Convolutional Neural Network with Self-Attention

Research Paper
Open access
Published: 20 September 2023

Volume 8, pages 385–395, (2023)
Cite this article

Download PDF

You have full access to this open access article

Data Science and Engineering Aims and scope Submit manuscript

Few-Shot Relation Prediction of Knowledge Graph via Convolutional Neural Network with Self-Attention

Download PDF

Shanna Zhong^1,2,
Jiahui Wang^1,2,
Kun Yue ORCID: orcid.org/0000-0003-3641-1461^1,2,
Liang Duan^1,2,
Zhengbao Sun³ &
…
Yan Fang²

1256 Accesses
1 Citation
Explore all metrics

Abstract

Knowledge graph (KG) has become the vital resource for various applications like question answering and recommendation system. However, several relations in KG only have few observed triples, which makes it necessary to develop the method for few-shot relation prediction. In this paper, we propose the Convolutional Neural Network with Self-Attention Relation Prediction (CARP) model to predict new facts with few observed triples. First, to learn the relation property features, we build a feature encoder by using the convolutional neural network with self-attention from the few observed triples rather than background knowledge. Then, by incorporating the learned features, we give an embedding network to learn the representation of incomplete triples. Finally, we give the loss function and training algorithm of our CARP model. Experimental results on three real-world datasets show that our proposed method improves Hits@10 by 48% on average over the state-of-the-art competitors.

Knowledge Graphs: Opportunities and Challenges

Article Open access 03 April 2023

Recommendation system based on deep learning methods: a systematic review and new directions

Article 03 August 2019

Graph convolutional networks: a comprehensive review

Article Open access 10 November 2019

1 Introduction

A fact in knowledge graph (KG) is expressed as a triple (h, r, t), where r indicates the relation between the head entity h and tail entity t. Large-scale KGs, such as Wordnet [1], YAGO [2], Freebase [3] and Wikidata [4], have become the vital resource for many artificial intelligence tasks, like question answering [5, 6], recommendation system [7, 8]. However, several relations only have few observed triples. For example, about 10% of the relations in Wikidata have no more than 10 observed triples [9].

It is challenging to extract effective and representative features with few observed triples. To this end, few-shot relation prediction [10] has attracted broad attention, which aims at predicting whether the incomplete triple (h, ?, t) holds w.r.t. r by only observing few triples about r.

Previous knowledge graph embedding (KGE) methods [11,12,13] require sufficient training triples to learn the representations of entities and relations, and thus cannot be adopted for few-shot relation prediction. Recent attempts [9, 14,15,16] introduce background information, such as neighbors and contexts of entities, to learn more features about entities and relations in few-shot scenarios, but the background information might not be always available. Actually, from the practical application point of view, the few observed triples contain useful features that have not been fully used.

For example, the relation Capital concerns three triples (China, Capital, Beijing), (Italy, Capital, Rome), and (France, Capital, Paris). Few-shot relation prediction aims at predicting whether the incomplete triple (UK, ?, London) or (UK, ?, Liverpool) holds w.r.t. Capital by only observing three triples about Capital. Note that the head entities {China, Italy, France} imply the property Country, and the tail entities {Beijing, Rome, Paris} imply the property City. These head and tail properties help choose the triples whose head and tail should have the properties of Country and City, respectively, such as (UK, ?, London) and (UK, ?, Liverpool). Furthermore, the given triples share the same relation Capital, which helps determine the correct triple (UK, Capital, London), instead of (UK, Capital, Liverpool). Note that both the few observed and correct triples include the properties Country, City and Capital. The more similar the properties included in an incomplete triple to the properties that are common in the observed triples, the more likely the incomplete triple is a fact. This means that the properties included in both the few observed and incomplete triples help predict the new facts in few-shot scenarios. Thus, it is valuable to develop the method to predict new facts by only observing few triples.

In this paper, we investigate to learn the features of relevant properties to improve the accuracy of few-shot relation prediction of KG without introducing background knowledge. For this purpose, two key challenges still remain: How to describe the correlations, and how to learn the property features from the few observed triples?

As shown in the aforementioned example, the few observed triples are frequently correlated with each other, which is useful to discern the property features. Fortunately, the self-attention mechanism [17] allows the inputs to correlate with each other and find out to whom they should pay more attention. By using the self-attention mechanism, we give different weights to different features of the observed triples to describe their correlations and thus, the property features could be highlighted. Meanwhile, it is known that convolutional neural network (CNN) [18] is particularly useful for learning the property features in a digital image via a set of convolutional kernels, which make the network tolerant to the translation of the image property features. Thus, by an analogy between the set of observed triples and the digital image in terms of the indivisibility and translation feature invariance, we build the feature encoder to learn the property features by incorporating the CNN with self-attention-based correlations.

Following, we learn the probability distributions of property features to enhance their representations as well as relevant relations. Next, we give the matching function by incorporating the property features into the incomplete triples, such that the model’s capability to match correct relations could be improved. Finally, we give the loss function by constraining the property feature space to ensure that the model is able to distinguish positive triples from negative ones. Specifically, by focusing on learning relation property features from the few observed triples, we propose the Convolutional Neural Network with Self-Attention Relation Prediction (CARP) model, including the following contributions:

We propose the method to learn property features from the few observed triples, so that the relation representation could be enhanced.
We give the loss function by constraining the space of property features and the training algorithm of our CARP model.
We conduct extensive experiments on NELL-One/FB-One/Wiki-One. Experimental results show that our CARP model is effective for few-shot relation prediction and outperforms the state-of-the-art competitors.

The remainder of this paper is organized as follows: Section 2 introduces related work. Section 3 gives our CARP model for few-shot relation prediction. Section 4 shows experimental result and performance studies. Section 5 concludes and discusses future work.

2 Related Work

Knowledge graph embedding-based relation prediction.

Many knowledge graph embedding methods have been successfully used for relation prediction, including the distance-based and the neural network-based methods. In the former kind of methods, TransE [19] interprets a relation as the translating operation between head-tail entity pairs. TransH [20] learns the relations as hyperplanes and projects head and tail entities to the relational-specific hyperplane to form embeddings. TransAt [21] learns the translation-based embedding, relation-related categories of entities, and relation-related attention simultaneously. In the latter kind of methods, ConvE [22] uses 2D convolution over embeddings to model the interactions between entities and relations. RESCAL [23] learns the inherent structure of dyadic relational data by tensor factorization. ComplEx [24] adopts complex valued embeddings to effectively learn the antisymmetric relations. GraIL [25] learns to predict relations over subgraph structures based on the graph neural network. TACT [26] categorizes all pairs of relations into several topological patterns and learns the importance of different patterns to facilitate link prediction. SAttLE [27] uses a large number of self-attention heads to capture the mutual information between entities and relations. However, these methods focus on learning the embeddings of entities and relations under the assumption that the model has sufficient training examples, but ignore the shared latent features within the few training triples corresponding to the same relation. It is still challenging to learn useful features with few training triples.

Few-shot relation prediction of KG. Several methods have been proposed for few-shot relation prediction of KG. For example, GMatching [9] proposes a neighbor encoder to enhance entity embedding with their local graph neighbor and performs multistep matching to compare the incomplete triple with the few observed triples. FSRL [14] designs a recurrent autoencoder aggregation network to aggregate the representation of few observed triples and employs a matching metric to discover new facts with few observed triples. FAAN [15] introduces an adaptive attention network to learn the dynamic representations via various impacts of neighbors and adopts a stack of transformer blocks to differentiate few observed triple’s contributions w.r.t. different incomplete triples. MetaR [28] focuses on transferring the relation-specific meta information to quickly optimize the model parameters. GANA [16] proposes a global–local framework based on a gated and attentive neighbor aggregator together with TransH to accurately integrate the semantics of neighbors and match the incomplete triple with the few observed ones. Li et al. [29] construct a Gaussian distribution for the relation of each triple in few-shot scenario according to the distributions of its similar relations in background graphs. HiRe [30] learns and refines the representation of relations by learning three levels of relational information (entity-level, triple-level and context-level). RSCL [31] exploits graph contexts of triples to learn the global and local relation-specific representations in few-shot scenarios. These few-shot relation prediction methods rely on the introduced background information, such as neighbors and contexts of entities, to learn more useful representations of the entities and relations. However, the background information might not be achieved easily in real-world KGs. On the contrary, the correlations implied in the few observed triples might not be fully used. Differently, we build the feature encoder based on CNN with self-attention to effectively learn the relation property features from the few observed triples without introducing background information.

3 Methodology

3.1 Definitions and Problem Formalization

We first define some concepts as the basis of later discussion.

Definition 1

KG is denoted as $\mathcal {G} = \langle \mathcal {E},\mathcal {R},\mathcal {T} \rangle$, where $\mathcal {E}$, $\mathcal {R}$, and $\mathcal {T}= \{ (h, r, t)| h\in \mathcal {E}, t\in \mathcal {E}, r\in \mathcal {R} \}$ denote the sets of entities, relations, and triples in KG, respectively.

Definition 2

A reference is denoted as $\mathcal {R}_r$, the set of few observed triples associated with relation r, where $\mathcal {R}_r=\{ (h_i,r,t_i) | h_{i}\in \mathcal {E}, t_{i}\in \mathcal {E}, r \in \mathcal {R}, (h_i,r,t_i)\in \mathcal {T}\}^k_{i=1}$.

Definition 3

A query is denoted as $\mathcal {Q}_r$, the set of incomplete triples to be predicted, where $\mathcal {Q}_r=\{(h_q,r,t_q) | h_{q}\in \mathcal {E}, t_{q}\in \mathcal {E}, r \in \mathcal {R}, (h_q,r,t_q) \notin \mathcal {R}_{r} \}_{q=1}^{k}$.

Definition 4

A few-shot relation prediction task, denoted as $\mathcal {T}_r=\{\mathcal {R}_r,\mathcal {Q}_r\}$, aims at predicting the triple in the query $\mathcal {Q}_r$ that holds for the relation r when given the reference $\mathcal {R}_r$.

To fulfill the few-shot relation prediction of KG, we construct a set of prediction tasks $\mathcal {T}_{mtr}=\{\mathcal {T}_r\}$ as the training set, where each task $\mathcal {T}_r$ corresponds to an individual relation r.

Similarly, we construct a set of new prediction tasks $\mathcal {T}_{mte}=\{\mathcal {T}_{r^{'}}\}$ as the testing set, where the relations are unseen. Table 1 shows an example of the tasks of training and testing for few-shot relation prediction.

Table 1 Example of training and testing tasks

Full size table

The problem to be solved in this paper is formulated as follows. Given a few-shot relation prediction task $\mathcal {T}_{r}=\{\mathcal {R}_{r},\mathcal {Q}_{r}\}$, we build the feature encoder to learn the head property feature $\textbf{z}_{h}$, tail property feature $\textbf{z}_{t}$, and relation property feature $\textbf{z}_{r}$ from $\mathcal {R}_{r}$. Then, for each $(h_{q},t_{q})$ in $\mathcal {Q}_{r}$, we use the embedding network to obtain the feature-enhanced representation $\hat{\textbf{z}}_{r}$ of $(h_{q},t_{q})$ by incorporating $\textbf{z}_{h}$ and $\textbf{z}_{t}$. Finally, we calculate the similarity score between $\hat{\textbf{z}}_{r}$ and $\textbf{z}_{r}$ to measure whether $(h_{q},t_{q})$ holds w.r.t. r.

3.2 Framework

As shown in Fig. 1, our CARP model consists of two major components: feature encoder for learning property features and matching processor for matching the incomplete with few observed triples.

3.2.1 Feature Encoder

In this component, we aim at mining the property features shared by the head and tail entities within the given few triples of the same relation, as well as the relation property features shared by the head-tail entity pairs, to facilitate the generation and selection of correct triples. For simplicity, we use the randomly initialized matrix $\textbf{X}$ to denote the embeddings of the head entities, tail entities and references. To describe the correlations among $\textbf{X}$, we project $\textbf{X}$ into a feature space by using the following linear transformation:

$$\begin{aligned} \textbf{V}=\textbf{W}_{f}\textbf{X} \end{aligned}$$

(1)

where $\textbf{W}_{f}$ denotes the transformation matrix.

To assign different weights to different features of $\textbf{V}$, we calculate the scaled dot products as the weights between $\textbf{V}$ and its transpose. Then, we use the following softmax function to obtain the attention scores $\textbf{X}_{attn}$ on $\textbf{V}$:

$$\begin{aligned} \textbf{X}_{attn} = \mathrm{{softmax}} \left(\frac{\textbf{V}\textbf{V}^{\top }}{\sqrt{d_k}}\right)\textbf{V} \end{aligned}$$

(2)

where $\frac{1}{\sqrt{d_k}}$ denotes the scaling factor.

Thus, the importance of the property features in $\textbf{X}$ could be highlighted as $\textbf{X}_{attn}$. Then, we feed $\textbf{X}_{attn}$ into the L-layers CNN to identify the property features. For each layer in the CNN, we give a convolutional kernel to learn the property feature of the current feature map and an activation function to introduce the nonlinearity. The l-th feature map could be obtained by the following function:

$$\begin{aligned} \textbf{X}^{l}=\textrm{ReLU}(\textrm{LN}(\textbf{W}^{l-1} *\textbf{X}^{l-1})+\textbf{b}^{l-1}),l=1,2,...,L \end{aligned}$$

(3)

where $\textbf{X}^0=\textbf{X}_{attn}$, $\textbf{X}^{l-1}$, $\textbf{W}^{l-1}$ and $\textbf{b}^{l-1}$ denote the feature map, convolution kernel and bias on the $(l-1)$-th layer, respectively; $*$; $\mathrm LN(\cdot )$ and $\mathrm ReLU(\cdot )$ denote the convolution, layer normalization and activation, respectively.

Since the mean-pooling function summarizes the features present in a region of the feature map generated by a convolution layer, we use the mean-pooling function upon the L-th feature map $\textbf{X}^{L}$ to obtain the property feature of $\textbf{X}$.

$$\begin{aligned} \textbf{x}=\textrm{Mean}(\textbf{X}^{L}) \end{aligned}$$

(4)

Finally, to enhance the representation of $\textbf{x}$, we learn its probability distribution by mapping $\textbf{x}$ to a Gaussian distribution $p(\textbf{z}|\textbf{x})=\mathcal {N}(\varvec{\mu },\varvec{\sigma }^{2})$, where the mean $\varvec{\mu }$ and standard deviation $\varvec{\sigma }$ constitute the output of the multilayer perceptron (MLP), defined as:

$$\begin{aligned} \varvec{\mu }&=\textbf{W}_{\mu }\textbf{x}+\textbf{b}_{\mu } \end{aligned}$$

(5)

$$\begin{aligned} \varvec{\sigma }&=\textbf{W}_{\sigma }\textbf{x}+\textbf{b}_{\sigma } \end{aligned}$$

(6)

where {$\textbf{W}_{\mu },\textbf{W}_{\sigma }$} and {$\textbf{b}_{\mu },\textbf{b}_{\sigma }$} denote the weights and biases, respectively.

Note that the process of sampling from a distribution is not differentiable. To solve this problem, we use the following reparameterization strategy to sample $\textbf{z}$ as the final representation of property feature, defined as:

$$\begin{aligned} \textbf{z}=\varvec{\mu }+ \varvec{\epsilon } \odot \varvec{\sigma } \end{aligned}$$

(7)

where $\varvec{\epsilon } \sim \mathcal {N}(0,\textbf{I})$, and $\odot$ denotes the element-wise product.

To constrain the property feature space, we assume that the prior of $\textbf{z}$ follows a standard normal distribution $q(\textbf{z})=\mathcal {N}(0,\textbf{I})$, since the sample distribution of a random variable will follow a normal distribution if the sample size is large enough according to the central limit theorem [32]. We then give the method to better approximate the posterior $p(\textbf{z}|\textbf{x})$ to the prior $q(\textbf{z})$ by calculating the Kullback–Leibler (KL) divergence between $p(\textbf{z}|\textbf{x})$ and $q(\textbf{z})$ as follows:

$$\begin{aligned} \begin{aligned} \mathcal {L}_{kl}&=KL(p(\textbf{z}|\textbf{x})||q(\textbf{z}))\\&=KL(\mathcal {N}(\varvec{\mu },\varvec{\sigma }^2)||\mathcal {N}(0,\textbf{I}))\\&=\int \frac{1}{\sqrt{2 \pi \varvec{\sigma }^{2}}}e^{\frac{{-(\textbf{x}-\varvec{\mu })}^{2}}{ 2 {\varvec{\sigma }}^{2}}}(\log \frac{e^{{-(\textbf{x}-\varvec{\mu })}^{2}/2 \varvec{\sigma }^{2}}/\sqrt{2 \pi {\varvec{\sigma }}^{2}}}{e^{{-\textbf{x}}^{2}/2}/\sqrt{2\pi }})d\textbf{x}\\&=\int \frac{1}{\sqrt{2\pi \varvec{\sigma }^{2}}}e^{\frac{{-(\textbf{x}-\varvec{\mu })}^{2}}{ 2 {\varvec{\sigma }}^{2}}} \log \{\frac{1}{\sqrt{\varvec{\sigma }^{2}}} e^{ \frac{\textbf{x}^{2}-(\textbf{x}-\varvec{\mu })^{2}/\varvec{\sigma }^{2}}{2}\}}d\textbf{x}\\&=\frac{1}{2}\int \frac{1}{\sqrt{2\pi \varvec{\sigma }^{2}}}e^{\frac{{-(\textbf{x}-\varvec{\mu })}^{2}}{ 2 {\varvec{\sigma }}^{2}}}[-\log \varvec{\sigma }^{2}+\textbf{x}^{2}-\frac{(\textbf{x}-\varvec{\mu })^{2}}{\varvec{\sigma }^{2}}]d\textbf{x}\\&=\frac{1}{2}(\varvec{\mu }^{2}+\varvec{\sigma }^{2}-\log \varvec{\sigma }^{2}-1) \end{aligned} \end{aligned}$$

(8)

3.2.2 Matching Processor

Given a few-shot relation prediction task $\mathcal {T}_{r}=\{\mathcal {R}_{r},\mathcal {Q}_{r}\}$, we first separate the reference $\mathcal {R}_{r}$ into two parts, the set of head entities, and the set of tail entities.

We randomly initialize the matrices $\textbf{H}$ and $\textbf{T}$ to denote the embeddings of head entities and tail entities. Then, we feed $\textbf{H}$ and $\textbf{T}$ into the feature encoder to obtain the head property feature $\textbf{z}_{h}$ and tail property feature $\textbf{z}_{t}$ by Eq. (7). Thus, we can obtain the KL loss $\mathcal {L}_{kl}^{h}$ of the head property feature and $\mathcal {L}_{kl}^{t}$ of the tail property feature by Eq. (8). Then, we obtain the head feature map ${\textbf {H}}^{L}$ of $\textbf{H}$ and tail feature map ${\textbf {T}}^{L}$ of $\textbf{T}$ by Eq. (3). Following, we concatenate $\textbf{H}$, ${\textbf {H}}^{L}$, $\textbf{T}$ and ${\textbf {T}}^{L}$ as the input of the feature encoder to obtain the relation property feature $\textbf{z}_{r}$ by Eq. (7) and the KL loss $\mathcal {L}_{kl}^{r}$ of the relation property feature by Eq. (8).

Note that there exist latent correlations between the reference $\mathcal {R}_{r}$ and query $\mathcal {Q}_{r}$ within the same few-shot relation task $\mathcal {T}_{r}$. To incorporate the correlations between $h_{q}$ and $t_{q}$ into $\textbf{q}_{h_{q},t_{q}}$, we build a MLP network by using two linear transformations with the following activation function:

$$\begin{aligned} \textbf{q}_{h_{q},t_{q}}=\textbf{W}_2 \cdot \textrm{ReLU}(\textbf{W}_1 \cdot (\textbf{h}_q \oplus \textbf{t}_q)+\textbf{b}_1)+\textbf{b}_2 \end{aligned}$$

(9)

where $\textbf{h}_{q}$ and $\textbf{t}_{q}$ denote the embeddings of $h_{q}$ and $t_{q}$, respectively, {$\textbf{W}_{1}$, $\textbf{W}_{2}$} and {$\textbf{b}_{1}$, $\textbf{b}_{2}$} denote the weights and biases, respectively, $\oplus$ and $\mathrm ReLU(\cdot )$ denote the concatenation and activation function, respectively.

Similarly, we incorporate the correlations between $\textbf{z}_{h}$ and $\textbf{z}_{t}$ into $\textbf{z}_{h,t}$ by Eq. (9).

To incorporate $\textbf{z}_{h,t}$ into $\textbf{q}_{h_{q},t_{q}}$, we build an embedding network by using the linear transformation with the following activation function upon $\textbf{z}_{h,t}$ and $\textbf{q}_{h_{q},t_{q}}$:

$$\begin{aligned} \hat{\textbf{z}}_{r}=\textbf{W}_{o} \cdot \tanh (\textbf{W}_{h}\textbf{z}_{h,t} + \textbf{W}_{i}\textbf{q}_{h_{q},t_{q}}) \end{aligned}$$

(10)

where $\textbf{z}_{h,t}$ denotes the representation of $(z_{h},z_{t})$, $\textbf{q}_{h_{q},t_{q}}$ denotes the representation of $(h_{q},t_{q})$, {$\textbf{W}_{o}$, $\textbf{W}_{h}$, $\textbf{W}_{i}$} denotes the weights, and $\tanh (\cdot )$ denotes the activation function.

Following, by the matching processor to match $(h_{q},t_{q})$ with $\mathcal {R}_{r}$, we transform the matching problem into the Euclidean distance-based clustering, since the independent few-shot relation prediction tasks could be viewed as clusters. To this end, we deem each few-shot relation prediction task as a cluster and take $\textbf{z}_{r}$ as the cluster center. Upon $\textbf{z}_{r}$ and $\hat{\textbf{z}}_{r}$, we use the following Euclidean distance to measure the distance between $\hat{\textbf{z}}_{r}$ and $\textbf{z}_{r}$:

$$\begin{aligned} f_{r}(h_{q},t_{q})=\Vert \hat{\textbf{z}}_{r}-\textbf{z}_{r} \Vert ^{2}_{2} \end{aligned}$$

(11)

where $\Vert \cdot \Vert ^{2}_{2}$ denotes the $L_{2}$ norm.

The smaller the $f_{r}(h_{q},t_{q})$, the more likely $\hat{\textbf{z}}_{r}$ belongs to the cluster of $\textbf{z}_{r}$, that is, $(h_{q},t_{q})$ holds w.r.t. the relation r.

3.3 Training Algorithm

For the relation r, we randomly sample k triples as the reference $\mathcal {R}_{r}=\{(h_{i},r,t_{i})|(h_{i},r,t_{i}) \in \mathcal {G} \}_{i=1}^{k}$. The remaining triples $\mathcal {Q}_{r}=\{(h_{q},r,t_{q})|(h_{q},r,t_{q}) \in \mathcal {G}, (h_{q},r,t_{q}) \notin \mathcal {R}_{r}\}$ are regarded as positive triples. Moreover, we construct a set of negative triples $\mathcal {N}_{r}=\{(h_{q},r,t^{-}_{q})|(h_{q},r,t^{-}_{q}) \notin \mathcal {G}\}$ by corrupting the tail entities.

To distinguish positive triples from negative ones and ensure that the similarity score between the positive triple and $\mathcal {R}_{r}$ is at least $\gamma$ lower than that between the negative one and $\mathcal {R}_{r}$, we minimize the following hinge loss [33] on both $\mathcal {Q}_{r}$ and $\mathcal {N}_{r}$:

$$\begin{aligned} \mathcal {L}_{q}=\sum \limits _{r}\sum \limits _{(h_{q},t_{q})\in \mathcal {Q}_{r}}\sum \limits _{(h_{q},t^{-}_{q})\in \mathcal {N}_{r}}[{f_{r}(h_{q},t_{q})+\gamma - f_{r}(h_{q},t^{-}_{q})}]_{+} \end{aligned}$$

(12)

where $[x]_{+}=max[0,x]$, $\gamma$ denotes the margin, $f_{r}(h_{q},t_{q})$ denotes the similarity score between $(h_q,r,t_q)$ and $\mathcal {R}_{r}$, and $f_{r}(h_{q},t^{-}_{q})$ denotes the similarity score between $(h_q,r,t_{q}^{-})$ and $\mathcal {R}_{r}$.

Meanwhile, we minimize the KL loss $\mathcal {L}_{kl}^{h}$ of the head property feature, $\mathcal {L}_{kl}^{t}$ of the tail property feature, and $\mathcal {L}_{kl}^{r}$ of the relation property feature obtained by Eq. (8) to constrain the space of the property features, since the smaller the KL loss, the more similar the probability distribution of the property features to the standard normal distribution. Generally, we give the loss function of our CARP model as follows:

$$\begin{aligned} \mathcal {L}=\mathcal {L}_{q}+ \mathcal {L}_{kl}^{h}+\mathcal {L}_{kl}^{t}+\mathcal {L}_{kl}^{r} \end{aligned}$$

(13)

The above idea is given in Algorithm 1, whose time complexity is $O(n \times |\mathcal {T}_{mtr}| \times |\mathbf {\Theta }|)$.

4 Experiments

In this section, we present experimental results on three real-life datasets to evaluate our CARP method. We first introduce the experimental settings, and then conduct four sets of experiments: (1) MRR/Hits@1/Hits@5/Hits@10 on 3/5-shot relation prediction, (2) impacts of few-shot size, (3) ablation study, and (4) case study to evaluate our method compared with existing methods.

4.1 Experiment Settings

Datasets and Evaluation Metrics. Our experiments were conducted on three KG datasets, NELL-One, FB-One, and Wiki-One, where NELL-One and Wiki-One were constructed by Xiong et al. [9]. NELL-One was based on NELL, a system collecting structured knowledge from news via an intelligent agent. Wiki-One was based on Wikidata, a free general structured knowledge base consisting of encyclopedic knowledge. Furthermore, we followed the similar process to build another dataset from Freebase, a large collaborative knowledge base consisting of social knowledge. Specifically, we first removed the inverse relations and then, selected the relations with over 50 but less than 500 triples as the dataset for few-shot relation prediction. Each few-shot relation prediction task consists of the triples corresponding to the same relation. There are 67, 131 and 183 few-shot relation prediction tasks on NELL-One, FB-One and Wiki-One, respectively. Following the original settings [9], we split the training/test/validation few-shot relation prediction tasks as 51/5/11, 98/11/22, and 133/16/34 on NELL-One, FB-One, and Wiki-One, respectively. The statistics of the datasets are shown in Table 2.

To evaluate the accuracy of our CARP model, we used two common ranking metrics: (1) Mean Reciprocal Rank (MRR), the average of the inverse of correct triples ranks. (2) Hits@k (k = 1, 5, 10), the proportion of correct triples ranked in the top-k. The higher the values of MRR and Hits@k, the better the performance of the relation prediction.

Table 2 Statistics of datasets

Full size table

Comparison Methods. We compared our proposed method with two categories of methods:

The embedding-based methods include TransE [19], TransH [20], RESCAL [23], ComplEx [24], GraIL [25] and TACT [26].
The few-shot relation prediction methods include GMatching [9], FAAN [15], FSRL [14], MetaR [28] and GANA [16].

Implementation. To implement few-shot relation prediction by KGE methods, we used all the triples in the training set and those in the reference of both the validation and testing set as the training triples. For TransE/TransH/ComplEx/RESCAL, we used the open-source codes released by [34]. For GraIL/TACT/GMatching/FAAN/FSRL/MetaR/GANA, we downloaded the code released by themselves. For a fair comparison, the embedding dimension was set to 100, 100, and 50 by following [9] for NELL-One, FB-One, and Wiki-One, respectively. During the training of CARP, we used Adam [35] with a learning rate of 0.001 to update the parameters. In all experiments, the batch size was set to 64 and the training epochs were set to 200 for NELL-One, 300 for FB-One, and 400 for Wiki-One, respectively.

4.2 Experimental Results

4.2.1 Exp-1: MRR/Hits@1/Hits@5/Hits@10 on 3/5-shot Relation Prediction.

MRR/Hits@1/Hits@5/Hits@10 of 3/5-shot relation prediction on NELL-One/FB-One/Wiki-One are shown in Table 3, which tell us that:

On NELL-One, for the 3-shot relation prediction, our model improves MRR, Hits@1, Hits@5, and Hits@10 by 131.19%, 178.52%, 99.63%, and 47.25% over the second-highest model, respectively. For the 5-shot relation prediction, our model improves MRR, Hits@1, Hits@5, and Hits@10 by 106.52%, 154.49%, 83.90%, and 44.41% over the second-highest model, respectively.
On FB-One, for the 3-shot relation prediction, our model improves MRR, Hits@1, Hits@5, and Hits@10 by 60.86%, 71.84%, 40.54%, and 24.33% over the second-highest model, respectively. For the 5-shot relation prediction, our model improves MRR, Hits@1, Hits@5, and Hits@10 by 72%, 87.32%, 49.9%, and 31.34% over the second-highest model, respectively.
On Wiki-One, for the 3-shot relation prediction, our model improves MRR, Hits@1, Hits@5, and Hits@10 by 90.30%, 99.56%, 80.43%, and 69.65% over the second-highest model, respectively. For the 5-shot relation prediction, our model improves MRR, Hits@1, Hits@5, and Hits@10 by 81.49%, 97.4%, 80.37%, and 68.79% over the second-highest model, respectively.

In summary, our CARP model improves MRR, Hits@1, Hits@3, and Hits@10 by 90%, 124%, 70%, and 48%, respectively, on average over the second-highest comparison model on three real-world datasets. This demonstrates that: (1) Our CARP model could adapt to different datasets, while the comparison methods perform unstably on different datasets. For example, FSRL performs better on FB-One but works worse on Wiki-One. (2) Our CARP model could learn more useful representation of the entities by mining the property features rather than using background information in few-shot scenario.

Table 3 MRR/Hits@1/Hits@5/Hits@10 on 3/5-shot relation prediction (Bold numbers denote the best results)

Full size table

4.2.2 Exp-2: Impacts of Few-shot Size

To evaluate the impacts of few-shot size k, we set $k=1,3,5,7$, and tested the MRR with different k. The results are reported in Fig. 2, which tell us that:

Our CARP model outperforms the comparison models in different k on NELL-One/FB-One/Wiki-One, demonstrating that the effectiveness of our model for few-shot relation prediction.
MRR increases slightly with the increase in k, indicating that the larger the reference, the richer the information learned by our CARP model.

4.2.3 Exp-3: Ablation Study

To evaluate the contributions of the feature encoder and matching processor, we conducted ablation studies with two settings. Firstly, to test the effectiveness of the feature encoder, we replaced the feature encoder module with a mean-pooling layer over the reference, denoted as AS_1. Secondly, to test how much the property feature learned with the feature encoder contributes to the query, we replaced the property feature with the random feature as the input of the embedding network, denoted as AS_2. The results are reported in Table 4, which tell us that:

Our CARP model outperforms the variant AS_1, indicating that the feature encoder of our model could learn more effective and representative features from the given reference. Specifically, for the 3-shot relation prediction, our model improves MRR, Hits@1, Hits@5 and Hits@10 by 17.34%, 16.57%, 18.97% and 19.38% over AS_1 on NELL-One, 9.83%, 9.93%, 9.74% and 10.94% over AS_1 on FB-One, 21.14%, 23.64%, 18.81% and 19.31% over AS_1 on Wiki-One. For the 5-shot relation prediction, our model improves MRR, Hits@1, Hits@5 and Hits@10 by 40.95%, 41.20%, 41.69% and 42.89% over AS_1 on NELL-One, 30.55%, 34.85%, 26.35% and 28.40% over AS_1 on FB-One, 20.28%, 23.24%, 17.21% and 17.51% over AS_1 on Wiki-One.
Our CARP model outperforms the variant AS_2, demonstrating that the feature encoder contributes largely to the query. Specifically, for the 3-shot relation prediction, our model improves MRR, Hits@1, Hits@5 and Hits@10 by 74.25%, 81.22%, 68.14% and 67.50% over AS_2 on NELL-One, 129.14%, 153.39%, 114.60% and 118.75% over AS_2 on FB-One, 70.57%, 76.36%, 65.06% and 70.57% over AS_2 on Wiki-One. For the 5-shot relation prediction, our model improves MRR, Hits@1, Hits@5 and Hits@10 by 93.09%, 98.60%, 89.08% and 89.86% over AS_2 on NELL-One, 169.80%, 198.17%, 149.66% and 153.74% over AS_2 on FB-One, 70.57% 89.21%, 78.15% and 78.05% over AS_2 on Wiki-One.

In summary, we can see that both the feature encoder and matching processor play crucial roles in our CARP model. This also verifies our assumption that the property features learned from the few observed triples play a crucial role in few-shot relation prediction.

Table 4 MRR/Hits@1/Hits@5/Hits@10 of model variants for 3/5-shot relation prediction (Bold numbers denote the best results)

Full size table

4.2.4 Exp-4: Case Study

We conducted case studies to evaluate the MRR of each few-shot relation prediction task on NELL-One/FB-One/Wiki-One. The results are reported in Fig. 3, which tell us that:

Our CARP model has low variance on NELL-One/FB-One/Wiki-One, while the comparison methods have high variance, demonstrating the stability of our method under different few-shot relation prediction tasks.
Our CARP model achieves the best MRR in 79% of different few-shot relation prediction tasks, suggesting that our method is robust for different few-shot relation prediction tasks.

5 Conclusion and Future Work

In this paper, we propose the CARP model to predict new facts with few observed triples. Focusing on learning the relation property features from the few observed triples rather than introducing background information, which may avoid the noise introduction. CARP not only enhances the representation of relations, but also facilitates to predict new facts in few-shot scenarios.

In the future, we will consider learning more valuable features about relations in few-shot scenarios. Besides, we will consider shuffling the order of the triples in the reference as a data augmentation strategy to enhance the representations of entities and relations.

Data availability

The data that support the findings of this study are available from the corresponding author upon reasonable request.

References

Miller GA (1995) Wordnet: a lexical database for english. Commun ACM 38(11):39–41
Article Google Scholar
Suchanek FM, Kasneci G, Weikum G (2007) Yago: a core of semantic knowledge. In: Proceedings of the 16th International Conference on World Wide Web, pp. 697–706
Bollacker KD, Evans C, Paritosh PK, Sturge T, Taylor J (2008) Freebase: a collaboratively created graph database for structuring human knowledge. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 1247–1250
Vrandecic D, Krötzsch M (2014) Wikidata: a free collaborative knowledgebase. Commun ACM 57(10):78–85
Article Google Scholar
Lan Y, He G, Jiang J, Jiang J, Zhao WX, Wen J (2021) A survey on complex knowledge base question answering: Methods, challenges and solutions. In: Proceedings of the 30th International Joint Conference on Artificial Intelligence, pp. 4483–4491
Lei Z, Zhang G, Wu L, Zhang K, Liang R (2022) A multi-level mesh mutual attention model for visual question answering. Data Sci Eng 7(4):339–353
Article Google Scholar
Chicaiza J, Díaz PV (2021) A comprehensive survey of knowledge graph-based recommender systems: Technologies, development, and contributions. Information 12(6):232
Article Google Scholar
Li C, Zhou B, Lin W, Tang Z, Tang Y, Zhang Y, Cao J (2023) A personalized explainable learner implicit friend recommendation method. Data Sci Eng 8(1):23–35
Article Google Scholar
Xiong W, Yu M, Chang S, Guo X, Wang WY (2018) One-shot relational learning for knowledge graphs. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 1980–1990
Wang Y, Yao Q, Kwok JT, Ni LM (2021) Generalizing from a few examples: a survey on few-shot learning. ACM Comput Surv 53(3):1–34
Article Google Scholar
Cai H, Zheng VW, Chang KC (2018) A comprehensive survey of graph embedding: problems, techniques, and applications. IEEE Trans Knowl Data Eng 30(9):1616–1637
Article Google Scholar
Rossi A, Barbosa D, Firmani D, Matinata A, Merialdo P (2021) Knowledge graph embedding for link prediction: a comparative analysis. ACM Trans Knowl Discov Data 15(2):1–49
Article Google Scholar
Sikos LF, Philp D (2020) Provenance-aware knowledge representation: A survey of data models and contextualized knowledge graphs. Data Sci Eng 5(3):293–316
Article Google Scholar
Zhang C, Yao H, Huang C, Jiang M, Li Z, Chawla NV (2020) Few-shot knowledge graph completion. In: Proceedings of the 34th AAAI Conference on Artificial Intelligence, pp. 3041–3048
Sheng J, Guo S, Chen Z, Yue J, Wang L, Liu T, Xu H (2020) Adaptive attentional network for few-shot knowledge graph completion. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pp. 1681–1691
Niu G, Li Y, Tang C, Geng R, Dai J, Liu Q, Wang H, Sun J, Huang F, Si L (2021) Relational learning with gated and attentive neighbor aggregator for few-shot knowledge graph completion. In: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 213–222
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. Proc Adv Neural Inf Proc Syst 30:5998–6008
Google Scholar
LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324
Article Google Scholar
Bordes A, Usunier N, García-Durán A, Weston J, Yakhnenko O (2013) Translating embeddings for modeling multi-relational data. In: Proceedings of the 27th Annual Conference on Neural Information Processing Systems 2013, pp. 2787–2795
Wang Z, Zhang J, Feng J, Chen Z (2014) Knowledge graph embedding by translating on hyperplanes. In: Proceedings of the 28th AAAI Conference on Artificial Intelligence, pp. 1112–1119
Qian W, Fu C, Zhu Y, Cai D, He X (2018) Translating embeddings for knowledge graph completion with relation attention mechanism. In: Proceedings of the 27th International Joint Conference on Artificial Intelligence, pp. 4286–4292
Dettmers T, Minervini P, Stenetorp P, Riedel S (2018) Convolutional 2d knowledge graph embeddings. In: Proceedings of the 32nd AAAI Conference on Artificial Intelligence,, pp. 1811–1818
Nickel M, Tresp V, Kriegel H (2011) A three-way model for collective learning on multi-relational data. In: Proceedings of the 28th International Conference on Machine Learning, pp. 809–816
Trouillon T, Welbl J, Riedel S, Gaussier É, Bouchard G (2016) Complex embeddings for simple link prediction. In: Proceedings of the 33nd International Conference on Machine Learning, pp. 2071–2080
Teru KK, Denis EG, Hamilton WL (2020) Inductive relation prediction by subgraph reasoning. In: Proceedings of the 37th International Conference on Machine Learning, pp. 9448–9457
Chen J, He H, Wu F, Wang J (2021) Topology-aware correlations between relations for inductive link prediction in knowledge graphs. In: Proceedings of the 35th AAAI Conference on Artificial Intelligence, pp. 6271–6278
Baghershahi P, Hosseini R, Moradi H (2023) Self-attention presents low-dimensional knowledge graph embeddings for link prediction. Knowl Based Syst 260:110124
Article Google Scholar
Chen M, Zhang W, Zhang W, Chen Q, Chen H (2019) Meta relational learning for few-shot link prediction in knowledge graphs. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pp. 4216–4225
Li Z, Geng P, Cao S, Hu B (2022) Few-shot knowledge graph completion based on data enhancement. In: Proceedings of the IEEE International Conference on Bioinformatics and Biomedicine, pp. 1607–1611
Wu H, Yin J, Rajaratnam B, Guo J (2023) Hierarchical relational learning for few-shot knowledge graph completion. In: International Conference on Learning Representation
Li Y, Yu K, Zhang Y, Wu X (2022) Learning relation-specific representations for few-shot knowledge graph completion. CoRR abs/2203.11639
Feller W (1971) An Introduction to Probability Theory and Its Applications. Wiley, Newyork
MATH Google Scholar
Rosset S, Zhu J, Hastie T (2003) Margin maximizing loss functions. Proc Adv Neural Inf Proc Syst 16:1237–1244
Google Scholar
Han X,Cao S, Lv X, Lin Y, Liu Z, Sun M, Li J (2018) Openke: An open toolkit for knowledge embedding. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 139–144
Kingma DP, Ba J (2015) Adam: A method for stochastic optimization. In: Proceedings of the 3rd International Conference on Learning Representations

Download references

Acknowledgements

This work is supported by the National Natural Science Foundation of China (62002311), Program of Key Laboratory of Yunnan Province (202205AG070003), Major Science and Technology Special Foundation of Yun- nan Province (202202AD080001), Foundation of Fundamental Research of Yunnan Province (202201AT070394). The authors would like to appreciate the anonymous reviewers for the constructive comments, which will help to improve this paper.

Author information

Authors and Affiliations

School of Information Science and Engineering, Yunnan University, Kunming, 650500, China
Shanna Zhong, Jiahui Wang, Kun Yue & Liang Duan
Yunnan Key Laboratory of Intelligent Systems and Computing, Yunnan University, Kunming, 650500, China
Shanna Zhong, Jiahui Wang, Kun Yue, Liang Duan & Yan Fang
School of Engineering, Yunnan University, Kunming, 650500, China
Zhengbao Sun

Authors

Shanna Zhong
View author publications
You can also search for this author in PubMed Google Scholar
Jiahui Wang
View author publications
You can also search for this author in PubMed Google Scholar
Kun Yue
View author publications
You can also search for this author in PubMed Google Scholar
Liang Duan
View author publications
You can also search for this author in PubMed Google Scholar
Zhengbao Sun
View author publications
You can also search for this author in PubMed Google Scholar
Yan Fang
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

The 1st author proposed the model and algorithm, completed the experiments and wrote the paper. The 2nd author optimized the model and revised the paper. The 3rd author optimized the algorithm and revised the paper; The 4th author analyzed the algorithm and optimized the experiments. The 5th author optimized the model. The 6th author revised the paper.

Corresponding author

Correspondence to Jiahui Wang.

Ethics declarations

Competing Interests

The authors declare that they have no known competing Financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Ethics approval.

Not applicable.

Consent for publication.

All authors have read and agreed to the published version of the manuscript.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Zhong, S., Wang, J., Yue, K. et al. Few-Shot Relation Prediction of Knowledge Graph via Convolutional Neural Network with Self-Attention. Data Sci. Eng. 8, 385–395 (2023). https://doi.org/10.1007/s41019-023-00230-x

Download citation

Received: 18 May 2023
Revised: 20 July 2023
Accepted: 28 August 2023
Published: 20 September 2023
Issue Date: December 2023
DOI: https://doi.org/10.1007/s41019-023-00230-x

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Few-Shot Relation Prediction of Knowledge Graph via Convolutional Neural Network with Self-Attention

Abstract

Similar content being viewed by others

Knowledge Graphs: Opportunities and Challenges

Recommendation system based on deep learning methods: a systematic review and new directions

Graph convolutional networks: a comprehensive review

1 Introduction

2 Related Work

3 Methodology

3.1 Definitions and Problem Formalization

Definition 1

Definition 2

Definition 3

Definition 4

3.2 Framework

3.2.1 Feature Encoder

3.2.2 Matching Processor

3.3 Training Algorithm

4 Experiments

4.1 Experiment Settings

4.2 Experimental Results

4.2.1 Exp-1: MRR/Hits@1/Hits@5/Hits@10 on 3/5-shot Relation Prediction.

4.2.2 Exp-2: Impacts of Few-shot Size

4.2.3 Exp-3: Ablation Study

4.2.4 Exp-4: Case Study

5 Conclusion and Future Work

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing Interests

Ethics approval.

Consent for publication.

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation