1 Introduction

A fact in knowledge graph (KG) is expressed as a triple (hrt), where r indicates the relation between the head entity h and tail entity t. Large-scale KGs, such as Wordnet [1], YAGO [2], Freebase [3] and Wikidata [4], have become the vital resource for many artificial intelligence tasks, like question answering [5, 6], recommendation system [7, 8]. However, several relations only have few observed triples. For example, about 10% of the relations in Wikidata have no more than 10 observed triples [9].

It is challenging to extract effective and representative features with few observed triples. To this end, few-shot relation prediction [10] has attracted broad attention, which aims at predicting whether the incomplete triple (h, ?, t) holds w.r.t. r by only observing few triples about r.

Previous knowledge graph embedding (KGE) methods [11,12,13] require sufficient training triples to learn the representations of entities and relations, and thus cannot be adopted for few-shot relation prediction. Recent attempts [9, 14,15,16] introduce background information, such as neighbors and contexts of entities, to learn more features about entities and relations in few-shot scenarios, but the background information might not be always available. Actually, from the practical application point of view, the few observed triples contain useful features that have not been fully used.

For example, the relation Capital concerns three triples (ChinaCapitalBeijing), (ItalyCapitalRome), and (FranceCapitalParis). Few-shot relation prediction aims at predicting whether the incomplete triple (UK, ?, London) or (UK, ?, Liverpool) holds w.r.t. Capital by only observing three triples about Capital. Note that the head entities {China, Italy, France} imply the property Country, and the tail entities {Beijing, Rome, Paris} imply the property City. These head and tail properties help choose the triples whose head and tail should have the properties of Country and City, respectively, such as (UK, ?, London) and (UK, ?, Liverpool). Furthermore, the given triples share the same relation Capital, which helps determine the correct triple (UKCapitalLondon), instead of (UKCapitalLiverpool). Note that both the few observed and correct triples include the properties Country, City and Capital. The more similar the properties included in an incomplete triple to the properties that are common in the observed triples, the more likely the incomplete triple is a fact. This means that the properties included in both the few observed and incomplete triples help predict the new facts in few-shot scenarios. Thus, it is valuable to develop the method to predict new facts by only observing few triples.

In this paper, we investigate to learn the features of relevant properties to improve the accuracy of few-shot relation prediction of KG without introducing background knowledge. For this purpose, two key challenges still remain: How to describe the correlations, and how to learn the property features from the few observed triples?

As shown in the aforementioned example, the few observed triples are frequently correlated with each other, which is useful to discern the property features. Fortunately, the self-attention mechanism [17] allows the inputs to correlate with each other and find out to whom they should pay more attention. By using the self-attention mechanism, we give different weights to different features of the observed triples to describe their correlations and thus, the property features could be highlighted. Meanwhile, it is known that convolutional neural network (CNN) [18] is particularly useful for learning the property features in a digital image via a set of convolutional kernels, which make the network tolerant to the translation of the image property features. Thus, by an analogy between the set of observed triples and the digital image in terms of the indivisibility and translation feature invariance, we build the feature encoder to learn the property features by incorporating the CNN with self-attention-based correlations.

Following, we learn the probability distributions of property features to enhance their representations as well as relevant relations. Next, we give the matching function by incorporating the property features into the incomplete triples, such that the model’s capability to match correct relations could be improved. Finally, we give the loss function by constraining the property feature space to ensure that the model is able to distinguish positive triples from negative ones. Specifically, by focusing on learning relation property features from the few observed triples, we propose the Convolutional Neural Network with Self-Attention Relation Prediction (CARP) model, including the following contributions:

  • We propose the method to learn property features from the few observed triples, so that the relation representation could be enhanced.

  • We give the loss function by constraining the space of property features and the training algorithm of our CARP model.

  • We conduct extensive experiments on NELL-One/FB-One/Wiki-One. Experimental results show that our CARP model is effective for few-shot relation prediction and outperforms the state-of-the-art competitors.

The remainder of this paper is organized as follows: Section 2 introduces related work. Section 3 gives our CARP model for few-shot relation prediction. Section 4 shows experimental result and performance studies. Section 5 concludes and discusses future work.

2 Related Work

Knowledge graph embedding-based relation prediction.

Many knowledge graph embedding methods have been successfully used for relation prediction, including the distance-based and the neural network-based methods. In the former kind of methods, TransE [19] interprets a relation as the translating operation between head-tail entity pairs. TransH [20] learns the relations as hyperplanes and projects head and tail entities to the relational-specific hyperplane to form embeddings. TransAt [21] learns the translation-based embedding, relation-related categories of entities, and relation-related attention simultaneously. In the latter kind of methods, ConvE [22] uses 2D convolution over embeddings to model the interactions between entities and relations. RESCAL [23] learns the inherent structure of dyadic relational data by tensor factorization. ComplEx [24] adopts complex valued embeddings to effectively learn the antisymmetric relations. GraIL [25] learns to predict relations over subgraph structures based on the graph neural network. TACT [26] categorizes all pairs of relations into several topological patterns and learns the importance of different patterns to facilitate link prediction. SAttLE [27] uses a large number of self-attention heads to capture the mutual information between entities and relations. However, these methods focus on learning the embeddings of entities and relations under the assumption that the model has sufficient training examples, but ignore the shared latent features within the few training triples corresponding to the same relation. It is still challenging to learn useful features with few training triples.

Few-shot relation prediction of KG. Several methods have been proposed for few-shot relation prediction of KG. For example, GMatching [9] proposes a neighbor encoder to enhance entity embedding with their local graph neighbor and performs multistep matching to compare the incomplete triple with the few observed triples. FSRL [14] designs a recurrent autoencoder aggregation network to aggregate the representation of few observed triples and employs a matching metric to discover new facts with few observed triples. FAAN [15] introduces an adaptive attention network to learn the dynamic representations via various impacts of neighbors and adopts a stack of transformer blocks to differentiate few observed triple’s contributions w.r.t. different incomplete triples. MetaR [28] focuses on transferring the relation-specific meta information to quickly optimize the model parameters. GANA [16] proposes a global–local framework based on a gated and attentive neighbor aggregator together with TransH to accurately integrate the semantics of neighbors and match the incomplete triple with the few observed ones. Li et al. [29] construct a Gaussian distribution for the relation of each triple in few-shot scenario according to the distributions of its similar relations in background graphs. HiRe [30] learns and refines the representation of relations by learning three levels of relational information (entity-level, triple-level and context-level). RSCL [31] exploits graph contexts of triples to learn the global and local relation-specific representations in few-shot scenarios. These few-shot relation prediction methods rely on the introduced background information, such as neighbors and contexts of entities, to learn more useful representations of the entities and relations. However, the background information might not be achieved easily in real-world KGs. On the contrary, the correlations implied in the few observed triples might not be fully used. Differently, we build the feature encoder based on CNN with self-attention to effectively learn the relation property features from the few observed triples without introducing background information.

3 Methodology

3.1 Definitions and Problem Formalization

We first define some concepts as the basis of later discussion.

Definition 1

KG is denoted as \(\mathcal {G} = \langle \mathcal {E},\mathcal {R},\mathcal {T} \rangle\), where \(\mathcal {E}\), \(\mathcal {R}\), and \(\mathcal {T}= \{ (h, r, t)| h\in \mathcal {E}, t\in \mathcal {E}, r\in \mathcal {R} \}\) denote the sets of entities, relations, and triples in KG, respectively.

Definition 2

A reference is denoted as \(\mathcal {R}_r\), the set of few observed triples associated with relation r, where \(\mathcal {R}_r=\{ (h_i,r,t_i) | h_{i}\in \mathcal {E}, t_{i}\in \mathcal {E}, r \in \mathcal {R}, (h_i,r,t_i)\in \mathcal {T}\}^k_{i=1}\).

Definition 3

A query is denoted as \(\mathcal {Q}_r\), the set of incomplete triples to be predicted, where \(\mathcal {Q}_r=\{(h_q,r,t_q) | h_{q}\in \mathcal {E}, t_{q}\in \mathcal {E}, r \in \mathcal {R}, (h_q,r,t_q) \notin \mathcal {R}_{r} \}_{q=1}^{k}\).

Definition 4

A few-shot relation prediction task, denoted as \(\mathcal {T}_r=\{\mathcal {R}_r,\mathcal {Q}_r\}\), aims at predicting the triple in the query \(\mathcal {Q}_r\) that holds for the relation r when given the reference \(\mathcal {R}_r\).

To fulfill the few-shot relation prediction of KG, we construct a set of prediction tasks \(\mathcal {T}_{mtr}=\{\mathcal {T}_r\}\) as the training set, where each task \(\mathcal {T}_r\) corresponds to an individual relation r.

Similarly, we construct a set of new prediction tasks \(\mathcal {T}_{mte}=\{\mathcal {T}_{r^{'}}\}\) as the testing set, where the relations are unseen. Table 1 shows an example of the tasks of training and testing for few-shot relation prediction.

Table 1 Example of training and testing tasks

The problem to be solved in this paper is formulated as follows. Given a few-shot relation prediction task \(\mathcal {T}_{r}=\{\mathcal {R}_{r},\mathcal {Q}_{r}\}\), we build the feature encoder to learn the head property feature \(\textbf{z}_{h}\), tail property feature \(\textbf{z}_{t}\), and relation property feature \(\textbf{z}_{r}\) from \(\mathcal {R}_{r}\). Then, for each \((h_{q},t_{q})\) in \(\mathcal {Q}_{r}\), we use the embedding network to obtain the feature-enhanced representation \(\hat{\textbf{z}}_{r}\) of \((h_{q},t_{q})\) by incorporating \(\textbf{z}_{h}\) and \(\textbf{z}_{t}\). Finally, we calculate the similarity score between \(\hat{\textbf{z}}_{r}\) and \(\textbf{z}_{r}\) to measure whether \((h_{q},t_{q})\) holds w.r.t. r.

3.2 Framework

As shown in Fig. 1, our CARP model consists of two major components: feature encoder for learning property features and matching processor for matching the incomplete with few observed triples.

Fig. 1
figure 1

Framework of CARP \((\oplus\) and \(\ominus\) denote the concatenation and subtraction, respectively. \(\textbf{H}\), \(\textbf{T}\), \(\textbf{z}_h\), \(\textbf{z}_t\), \(\textbf{z}_r\), \(\textbf{h}_q\), \(\textbf{t}_q\) and \(\mathbf {\hat{z}}_r\) denote the embedding of head entities, embedding of the tail entities, head property feature, tail property feature, relation property feature, embedding of \(h_q\), embedding of \(t_q\) and embedding of the entity pair \((h_q,t_q)\), respectively)

3.2.1 Feature Encoder

In this component, we aim at mining the property features shared by the head and tail entities within the given few triples of the same relation, as well as the relation property features shared by the head-tail entity pairs, to facilitate the generation and selection of correct triples. For simplicity, we use the randomly initialized matrix \(\textbf{X}\) to denote the embeddings of the head entities, tail entities and references. To describe the correlations among \(\textbf{X}\), we project \(\textbf{X}\) into a feature space by using the following linear transformation:

$$\begin{aligned} \textbf{V}=\textbf{W}_{f}\textbf{X} \end{aligned}$$
(1)

where \(\textbf{W}_{f}\) denotes the transformation matrix.

To assign different weights to different features of \(\textbf{V}\), we calculate the scaled dot products as the weights between \(\textbf{V}\) and its transpose. Then, we use the following softmax function to obtain the attention scores \(\textbf{X}_{attn}\) on \(\textbf{V}\):

$$\begin{aligned} \textbf{X}_{attn} = \mathrm{{softmax}} \left(\frac{\textbf{V}\textbf{V}^{\top }}{\sqrt{d_k}}\right)\textbf{V} \end{aligned}$$
(2)

where \(\frac{1}{\sqrt{d_k}}\) denotes the scaling factor.

Thus, the importance of the property features in \(\textbf{X}\) could be highlighted as \(\textbf{X}_{attn}\). Then, we feed \(\textbf{X}_{attn}\) into the L-layers CNN to identify the property features. For each layer in the CNN, we give a convolutional kernel to learn the property feature of the current feature map and an activation function to introduce the nonlinearity. The l-th feature map could be obtained by the following function:

$$\begin{aligned} \textbf{X}^{l}=\textrm{ReLU}(\textrm{LN}(\textbf{W}^{l-1} *\textbf{X}^{l-1})+\textbf{b}^{l-1}),l=1,2,...,L \end{aligned}$$
(3)

where \(\textbf{X}^0=\textbf{X}_{attn}\), \(\textbf{X}^{l-1}\), \(\textbf{W}^{l-1}\) and \(\textbf{b}^{l-1}\) denote the feature map, convolution kernel and bias on the \((l-1)\)-th layer, respectively; \(*\); \(\mathrm LN(\cdot )\) and \(\mathrm ReLU(\cdot )\) denote the convolution, layer normalization and activation, respectively.

Since the mean-pooling function summarizes the features present in a region of the feature map generated by a convolution layer, we use the mean-pooling function upon the L-th feature map \(\textbf{X}^{L}\) to obtain the property feature of \(\textbf{X}\).

$$\begin{aligned} \textbf{x}=\textrm{Mean}(\textbf{X}^{L}) \end{aligned}$$
(4)

Finally, to enhance the representation of \(\textbf{x}\), we learn its probability distribution by mapping \(\textbf{x}\) to a Gaussian distribution \(p(\textbf{z}|\textbf{x})=\mathcal {N}(\varvec{\mu },\varvec{\sigma }^{2})\), where the mean \(\varvec{\mu }\) and standard deviation \(\varvec{\sigma }\) constitute the output of the multilayer perceptron (MLP), defined as:

$$\begin{aligned} \varvec{\mu }&=\textbf{W}_{\mu }\textbf{x}+\textbf{b}_{\mu } \end{aligned}$$
(5)
$$\begin{aligned} \varvec{\sigma }&=\textbf{W}_{\sigma }\textbf{x}+\textbf{b}_{\sigma } \end{aligned}$$
(6)

where {\(\textbf{W}_{\mu },\textbf{W}_{\sigma }\)} and {\(\textbf{b}_{\mu },\textbf{b}_{\sigma }\)} denote the weights and biases, respectively.

Note that the process of sampling from a distribution is not differentiable. To solve this problem, we use the following reparameterization strategy to sample \(\textbf{z}\) as the final representation of property feature, defined as:

$$\begin{aligned} \textbf{z}=\varvec{\mu }+ \varvec{\epsilon } \odot \varvec{\sigma } \end{aligned}$$
(7)

where \(\varvec{\epsilon } \sim \mathcal {N}(0,\textbf{I})\), and \(\odot\) denotes the element-wise product.

To constrain the property feature space, we assume that the prior of \(\textbf{z}\) follows a standard normal distribution \(q(\textbf{z})=\mathcal {N}(0,\textbf{I})\), since the sample distribution of a random variable will follow a normal distribution if the sample size is large enough according to the central limit theorem [32]. We then give the method to better approximate the posterior \(p(\textbf{z}|\textbf{x})\) to the prior \(q(\textbf{z})\) by calculating the Kullback–Leibler (KL) divergence between \(p(\textbf{z}|\textbf{x})\) and \(q(\textbf{z})\) as follows:

$$\begin{aligned} \begin{aligned} \mathcal {L}_{kl}&=KL(p(\textbf{z}|\textbf{x})||q(\textbf{z}))\\&=KL(\mathcal {N}(\varvec{\mu },\varvec{\sigma }^2)||\mathcal {N}(0,\textbf{I}))\\&=\int \frac{1}{\sqrt{2 \pi \varvec{\sigma }^{2}}}e^{\frac{{-(\textbf{x}-\varvec{\mu })}^{2}}{ 2 {\varvec{\sigma }}^{2}}}(\log \frac{e^{{-(\textbf{x}-\varvec{\mu })}^{2}/2 \varvec{\sigma }^{2}}/\sqrt{2 \pi {\varvec{\sigma }}^{2}}}{e^{{-\textbf{x}}^{2}/2}/\sqrt{2\pi }})d\textbf{x}\\&=\int \frac{1}{\sqrt{2\pi \varvec{\sigma }^{2}}}e^{\frac{{-(\textbf{x}-\varvec{\mu })}^{2}}{ 2 {\varvec{\sigma }}^{2}}} \log \{\frac{1}{\sqrt{\varvec{\sigma }^{2}}} e^{ \frac{\textbf{x}^{2}-(\textbf{x}-\varvec{\mu })^{2}/\varvec{\sigma }^{2}}{2}\}}d\textbf{x}\\&=\frac{1}{2}\int \frac{1}{\sqrt{2\pi \varvec{\sigma }^{2}}}e^{\frac{{-(\textbf{x}-\varvec{\mu })}^{2}}{ 2 {\varvec{\sigma }}^{2}}}[-\log \varvec{\sigma }^{2}+\textbf{x}^{2}-\frac{(\textbf{x}-\varvec{\mu })^{2}}{\varvec{\sigma }^{2}}]d\textbf{x}\\&=\frac{1}{2}(\varvec{\mu }^{2}+\varvec{\sigma }^{2}-\log \varvec{\sigma }^{2}-1) \end{aligned} \end{aligned}$$
(8)

3.2.2 Matching Processor

Given a few-shot relation prediction task \(\mathcal {T}_{r}=\{\mathcal {R}_{r},\mathcal {Q}_{r}\}\), we first separate the reference \(\mathcal {R}_{r}\) into two parts, the set of head entities, and the set of tail entities.

We randomly initialize the matrices \(\textbf{H}\) and \(\textbf{T}\) to denote the embeddings of head entities and tail entities. Then, we feed \(\textbf{H}\) and \(\textbf{T}\) into the feature encoder to obtain the head property feature \(\textbf{z}_{h}\) and tail property feature \(\textbf{z}_{t}\) by Eq. (7). Thus, we can obtain the KL loss \(\mathcal {L}_{kl}^{h}\) of the head property feature and \(\mathcal {L}_{kl}^{t}\) of the tail property feature by Eq. (8). Then, we obtain the head feature map \({\textbf {H}}^{L}\) of \(\textbf{H}\) and tail feature map \({\textbf {T}}^{L}\) of \(\textbf{T}\) by Eq. (3). Following, we concatenate \(\textbf{H}\), \({\textbf {H}}^{L}\), \(\textbf{T}\) and \({\textbf {T}}^{L}\) as the input of the feature encoder to obtain the relation property feature \(\textbf{z}_{r}\) by Eq. (7) and the KL loss \(\mathcal {L}_{kl}^{r}\) of the relation property feature by Eq. (8).

Note that there exist latent correlations between the reference \(\mathcal {R}_{r}\) and query \(\mathcal {Q}_{r}\) within the same few-shot relation task \(\mathcal {T}_{r}\). To incorporate the correlations between \(h_{q}\) and \(t_{q}\) into \(\textbf{q}_{h_{q},t_{q}}\), we build a MLP network by using two linear transformations with the following activation function:

$$\begin{aligned} \textbf{q}_{h_{q},t_{q}}=\textbf{W}_2 \cdot \textrm{ReLU}(\textbf{W}_1 \cdot (\textbf{h}_q \oplus \textbf{t}_q)+\textbf{b}_1)+\textbf{b}_2 \end{aligned}$$
(9)

where \(\textbf{h}_{q}\) and \(\textbf{t}_{q}\) denote the embeddings of \(h_{q}\) and \(t_{q}\), respectively, {\(\textbf{W}_{1}\), \(\textbf{W}_{2}\)} and {\(\textbf{b}_{1}\), \(\textbf{b}_{2}\)} denote the weights and biases, respectively, \(\oplus\) and \(\mathrm ReLU(\cdot )\) denote the concatenation and activation function, respectively.

Similarly, we incorporate the correlations between \(\textbf{z}_{h}\) and \(\textbf{z}_{t}\) into \(\textbf{z}_{h,t}\) by Eq. (9).

To incorporate \(\textbf{z}_{h,t}\) into \(\textbf{q}_{h_{q},t_{q}}\), we build an embedding network by using the linear transformation with the following activation function upon \(\textbf{z}_{h,t}\) and \(\textbf{q}_{h_{q},t_{q}}\):

$$\begin{aligned} \hat{\textbf{z}}_{r}=\textbf{W}_{o} \cdot \tanh (\textbf{W}_{h}\textbf{z}_{h,t} + \textbf{W}_{i}\textbf{q}_{h_{q},t_{q}}) \end{aligned}$$
(10)

where \(\textbf{z}_{h,t}\) denotes the representation of \((z_{h},z_{t})\), \(\textbf{q}_{h_{q},t_{q}}\) denotes the representation of \((h_{q},t_{q})\), {\(\textbf{W}_{o}\), \(\textbf{W}_{h}\), \(\textbf{W}_{i}\)} denotes the weights, and \(\tanh (\cdot )\) denotes the activation function.

Following, by the matching processor to match \((h_{q},t_{q})\) with \(\mathcal {R}_{r}\), we transform the matching problem into the Euclidean distance-based clustering, since the independent few-shot relation prediction tasks could be viewed as clusters. To this end, we deem each few-shot relation prediction task as a cluster and take \(\textbf{z}_{r}\) as the cluster center. Upon \(\textbf{z}_{r}\) and \(\hat{\textbf{z}}_{r}\), we use the following Euclidean distance to measure the distance between \(\hat{\textbf{z}}_{r}\) and \(\textbf{z}_{r}\):

$$\begin{aligned} f_{r}(h_{q},t_{q})=\Vert \hat{\textbf{z}}_{r}-\textbf{z}_{r} \Vert ^{2}_{2} \end{aligned}$$
(11)

where \(\Vert \cdot \Vert ^{2}_{2}\) denotes the \(L_{2}\) norm.

The smaller the \(f_{r}(h_{q},t_{q})\), the more likely \(\hat{\textbf{z}}_{r}\) belongs to the cluster of \(\textbf{z}_{r}\), that is, \((h_{q},t_{q})\) holds w.r.t. the relation r.

3.3 Training Algorithm

For the relation r, we randomly sample k triples as the reference \(\mathcal {R}_{r}=\{(h_{i},r,t_{i})|(h_{i},r,t_{i}) \in \mathcal {G} \}_{i=1}^{k}\). The remaining triples \(\mathcal {Q}_{r}=\{(h_{q},r,t_{q})|(h_{q},r,t_{q}) \in \mathcal {G}, (h_{q},r,t_{q}) \notin \mathcal {R}_{r}\}\) are regarded as positive triples. Moreover, we construct a set of negative triples \(\mathcal {N}_{r}=\{(h_{q},r,t^{-}_{q})|(h_{q},r,t^{-}_{q}) \notin \mathcal {G}\}\) by corrupting the tail entities.

To distinguish positive triples from negative ones and ensure that the similarity score between the positive triple and \(\mathcal {R}_{r}\) is at least \(\gamma\) lower than that between the negative one and \(\mathcal {R}_{r}\), we minimize the following hinge loss [33] on both \(\mathcal {Q}_{r}\) and \(\mathcal {N}_{r}\):

$$\begin{aligned} \mathcal {L}_{q}=\sum \limits _{r}\sum \limits _{(h_{q},t_{q})\in \mathcal {Q}_{r}}\sum \limits _{(h_{q},t^{-}_{q})\in \mathcal {N}_{r}}[{f_{r}(h_{q},t_{q})+\gamma - f_{r}(h_{q},t^{-}_{q})}]_{+} \end{aligned}$$
(12)

where \([x]_{+}=max[0,x]\), \(\gamma\) denotes the margin, \(f_{r}(h_{q},t_{q})\) denotes the similarity score between \((h_q,r,t_q)\) and \(\mathcal {R}_{r}\), and \(f_{r}(h_{q},t^{-}_{q})\) denotes the similarity score between \((h_q,r,t_{q}^{-})\) and \(\mathcal {R}_{r}\).

Meanwhile, we minimize the KL loss \(\mathcal {L}_{kl}^{h}\) of the head property feature, \(\mathcal {L}_{kl}^{t}\) of the tail property feature, and \(\mathcal {L}_{kl}^{r}\) of the relation property feature obtained by Eq. (8) to constrain the space of the property features, since the smaller the KL loss, the more similar the probability distribution of the property features to the standard normal distribution. Generally, we give the loss function of our CARP model as follows:

$$\begin{aligned} \mathcal {L}=\mathcal {L}_{q}+ \mathcal {L}_{kl}^{h}+\mathcal {L}_{kl}^{t}+\mathcal {L}_{kl}^{r} \end{aligned}$$
(13)

The above idea is given in Algorithm 1, whose time complexity is \(O(n \times |\mathcal {T}_{mtr}| \times |\mathbf {\Theta }|)\).

figure a

4 Experiments

In this section, we present experimental results on three real-life datasets to evaluate our CARP method. We first introduce the experimental settings, and then conduct four sets of experiments: (1) MRR/Hits@1/Hits@5/Hits@10 on 3/5-shot relation prediction, (2) impacts of few-shot size, (3) ablation study, and (4) case study to evaluate our method compared with existing methods.

4.1 Experiment Settings

Datasets and Evaluation Metrics. Our experiments were conducted on three KG datasets, NELL-One, FB-One, and Wiki-One, where NELL-One and Wiki-One were constructed by Xiong et al. [9]. NELL-One was based on NELL, a system collecting structured knowledge from news via an intelligent agent. Wiki-One was based on Wikidata, a free general structured knowledge base consisting of encyclopedic knowledge. Furthermore, we followed the similar process to build another dataset from Freebase, a large collaborative knowledge base consisting of social knowledge. Specifically, we first removed the inverse relations and then, selected the relations with over 50 but less than 500 triples as the dataset for few-shot relation prediction. Each few-shot relation prediction task consists of the triples corresponding to the same relation. There are 67, 131 and 183 few-shot relation prediction tasks on NELL-One, FB-One and Wiki-One, respectively. Following the original settings [9], we split the training/test/validation few-shot relation prediction tasks as 51/5/11, 98/11/22, and 133/16/34 on NELL-One, FB-One, and Wiki-One, respectively. The statistics of the datasets are shown in Table 2.

To evaluate the accuracy of our CARP model, we used two common ranking metrics: (1) Mean Reciprocal Rank (MRR), the average of the inverse of correct triples ranks. (2) Hits@k (k = 1, 5, 10), the proportion of correct triples ranked in the top-k. The higher the values of MRR and Hits@k, the better the performance of the relation prediction.

Table 2 Statistics of datasets

Comparison Methods. We compared our proposed method with two categories of methods:

  • The embedding-based methods include TransE [19], TransH [20], RESCAL [23], ComplEx [24], GraIL [25] and TACT [26].

  • The few-shot relation prediction methods include GMatching [9], FAAN [15], FSRL [14], MetaR [28] and GANA [16].

Implementation. To implement few-shot relation prediction by KGE methods, we used all the triples in the training set and those in the reference of both the validation and testing set as the training triples. For TransE/TransH/ComplEx/RESCAL, we used the open-source codes released by [34]. For GraIL/TACT/GMatching/FAAN/FSRL/MetaR/GANA, we downloaded the code released by themselves. For a fair comparison, the embedding dimension was set to 100, 100, and 50 by following [9] for NELL-One, FB-One, and Wiki-One, respectively. During the training of CARP, we used Adam [35] with a learning rate of 0.001 to update the parameters. In all experiments, the batch size was set to 64 and the training epochs were set to 200 for NELL-One, 300 for FB-One, and 400 for Wiki-One, respectively.

4.2 Experimental Results

4.2.1 Exp-1: MRR/Hits@1/Hits@5/Hits@10 on 3/5-shot Relation Prediction.

MRR/Hits@1/Hits@5/Hits@10 of 3/5-shot relation prediction on NELL-One/FB-One/Wiki-One are shown in Table 3, which tell us that:

  • On NELL-One, for the 3-shot relation prediction, our model improves MRR, Hits@1, Hits@5, and Hits@10 by 131.19%, 178.52%, 99.63%, and 47.25% over the second-highest model, respectively. For the 5-shot relation prediction, our model improves MRR, Hits@1, Hits@5, and Hits@10 by 106.52%, 154.49%, 83.90%, and 44.41% over the second-highest model, respectively.

  • On FB-One, for the 3-shot relation prediction, our model improves MRR, Hits@1, Hits@5, and Hits@10 by 60.86%, 71.84%, 40.54%, and 24.33% over the second-highest model, respectively. For the 5-shot relation prediction, our model improves MRR, Hits@1, Hits@5, and Hits@10 by 72%, 87.32%, 49.9%, and 31.34% over the second-highest model, respectively.

  • On Wiki-One, for the 3-shot relation prediction, our model improves MRR, Hits@1, Hits@5, and Hits@10 by 90.30%, 99.56%, 80.43%, and 69.65% over the second-highest model, respectively. For the 5-shot relation prediction, our model improves MRR, Hits@1, Hits@5, and Hits@10 by 81.49%, 97.4%, 80.37%, and 68.79% over the second-highest model, respectively.

In summary, our CARP model improves MRR, Hits@1, Hits@3, and Hits@10 by 90%, 124%, 70%, and 48%, respectively, on average over the second-highest comparison model on three real-world datasets. This demonstrates that: (1) Our CARP model could adapt to different datasets, while the comparison methods perform unstably on different datasets. For example, FSRL performs better on FB-One but works worse on Wiki-One. (2) Our CARP model could learn more useful representation of the entities by mining the property features rather than using background information in few-shot scenario.

Table 3 MRR/Hits@1/Hits@5/Hits@10 on 3/5-shot relation prediction (Bold numbers denote the best results)

4.2.2 Exp-2: Impacts of Few-shot Size

To evaluate the impacts of few-shot size k, we set \(k=1,3,5,7\), and tested the MRR with different k. The results are reported in Fig. 2, which tell us that:

  • Our CARP model outperforms the comparison models in different k on NELL-One/FB-One/Wiki-One, demonstrating that the effectiveness of our model for few-shot relation prediction.

  • MRR increases slightly with the increase in k, indicating that the larger the reference, the richer the information learned by our CARP model.

Fig. 2
figure 2

Impacts of few-shot size k

4.2.3 Exp-3: Ablation Study

To evaluate the contributions of the feature encoder and matching processor, we conducted ablation studies with two settings. Firstly, to test the effectiveness of the feature encoder, we replaced the feature encoder module with a mean-pooling layer over the reference, denoted as AS_1. Secondly, to test how much the property feature learned with the feature encoder contributes to the query, we replaced the property feature with the random feature as the input of the embedding network, denoted as AS_2. The results are reported in Table 4, which tell us that:

  • Our CARP model outperforms the variant AS_1, indicating that the feature encoder of our model could learn more effective and representative features from the given reference. Specifically, for the 3-shot relation prediction, our model improves MRR, Hits@1, Hits@5 and Hits@10 by 17.34%, 16.57%, 18.97% and 19.38% over AS_1 on NELL-One, 9.83%, 9.93%, 9.74% and 10.94% over AS_1 on FB-One, 21.14%, 23.64%, 18.81% and 19.31% over AS_1 on Wiki-One. For the 5-shot relation prediction, our model improves MRR, Hits@1, Hits@5 and Hits@10 by 40.95%, 41.20%, 41.69% and 42.89% over AS_1 on NELL-One, 30.55%, 34.85%, 26.35% and 28.40% over AS_1 on FB-One, 20.28%, 23.24%, 17.21% and 17.51% over AS_1 on Wiki-One.

  • Our CARP model outperforms the variant AS_2, demonstrating that the feature encoder contributes largely to the query. Specifically, for the 3-shot relation prediction, our model improves MRR, Hits@1, Hits@5 and Hits@10 by 74.25%, 81.22%, 68.14% and 67.50% over AS_2 on NELL-One, 129.14%, 153.39%, 114.60% and 118.75% over AS_2 on FB-One, 70.57%, 76.36%, 65.06% and 70.57% over AS_2 on Wiki-One. For the 5-shot relation prediction, our model improves MRR, Hits@1, Hits@5 and Hits@10 by 93.09%, 98.60%, 89.08% and 89.86% over AS_2 on NELL-One, 169.80%, 198.17%, 149.66% and 153.74% over AS_2 on FB-One, 70.57% 89.21%, 78.15% and 78.05% over AS_2 on Wiki-One.

In summary, we can see that both the feature encoder and matching processor play crucial roles in our CARP model. This also verifies our assumption that the property features learned from the few observed triples play a crucial role in few-shot relation prediction.

Table 4 MRR/Hits@1/Hits@5/Hits@10 of model variants for 3/5-shot relation prediction (Bold numbers denote the best results)

4.2.4 Exp-4: Case Study

We conducted case studies to evaluate the MRR of each few-shot relation prediction task on NELL-One/FB-One/Wiki-One. The results are reported in Fig. 3, which tell us that:

  • Our CARP model has low variance on NELL-One/FB-One/Wiki-One, while the comparison methods have high variance, demonstrating the stability of our method under different few-shot relation prediction tasks.

  • Our CARP model achieves the best MRR in 79% of different few-shot relation prediction tasks, suggesting that our method is robust for different few-shot relation prediction tasks.

Fig. 3
figure 3

MRR on different relations

5 Conclusion and Future Work

In this paper, we propose the CARP model to predict new facts with few observed triples. Focusing on learning the relation property features from the few observed triples rather than introducing background information, which may avoid the noise introduction. CARP not only enhances the representation of relations, but also facilitates to predict new facts in few-shot scenarios.

In the future, we will consider learning more valuable features about relations in few-shot scenarios. Besides, we will consider shuffling the order of the triples in the reference as a data augmentation strategy to enhance the representations of entities and relations.