Modeling Large-Scale Structured Relationships with Shared Memory for Knowledge Base Completion

Recent studies on knowledge base completion, the task of recovering missing relationships based on recorded relations, demonstrate the importance of learning embeddings from multi-step relations. However, due to the size of knowledge bases, learning multi-step relations directly on top of observed triplets could be costly. Hence, a manually designed procedure is often used when training the models. In this paper, we propose Implicit ReasoNets (IRNs), which is designed to perform multi-step inference implicitly through a controller and shared memory. Without a human-designed inference procedure, IRNs use training data to learn to perform multi-step inference in an embedding neural space through the shared memory and controller. While the inference procedure does not explicitly operate on top of observed triplets, our proposed model outperforms all previous approaches on the popular FB15k benchmark by more than 5.7%.

PERSON entities have no recorded nationality in a recent version of Freebase.We seek to infer unknown entities based on the observed entities and relations.Thus, the knowledge base completion (KBC) task has emerged an important open research problem (Nickel et al., 2011).
Neural-network based methods have been very popular for solving the KBC task.Following Bordes et al. (2013), one of the most popular approaches for KBC is to learn vector-space representations of entities and relations during training, and then apply linear or bi-linear operations to infer the missing relations at test time.However, several recent papers demonstrate limitations of prior approaches relying upon vector-space models alone (Guu et al., 2015;Toutanova et al., 2016;Lin et al., 2015a).By themselves, there is no straightforward way to capture the structured relationships between multiple triples adequately.For example, assume that we want to fill in the missing relation for the triple (Obama, NATIONALITY, ?), a multi-step search procedure might be needed to discover the evidence in the observed triples such as (Obama, BORNIN, Hawaii) and (Hawaii, PARTOF, U.S.A).To address this issue, Guu et al. (2015); Toutanova et al. (2016); Lin et al. (2015a) propose different approaches of injecting structured information based on the human-designed inference procedure (e.g., random walk) that directly operates on the observed triplets.Unfortunately, due to the size of knowledge bases, these newly proposed approaches suffer from some limitations, as most paths are not informative for inferring missing relations, and it is prohibitive to consider all possible paths during the training time.
In this paper, we propose Implicit ReasoNets (IRNs) that take a different approach from prior work on KBC by addressing the challenges of performing multi-step inference through the design of controller and shared memory.We design a shared memory component to store KB information implicitly.That is, the model needs to determine what information it should store.Moreover, instead of explicitly manipulating the observed triples based on the human-designed inference procedure, the proposed model learns the multi-step inference procedure implicitly, i.e., without human intervention.Specifically, our model makes the prediction several times while forming different intermediate representations along the way.The controller determines how many steps the model should proceed given an input.At each step, a new representation is formed by taking the current representation and a context vector generated by accessing the shared memory.The detailed process is introduced in Section 3.3 and an overview of the model is shown in Figure 1.
The main contributions of our paper are as follows: • We propose Implicit ReasoNets (IRNs), which use a shared memory guided by a controller to model multi-step structured relationships implicitly.
• We evaluate IRNs and demonstrate that our proposed model achieves the state-of-the-art results on the popular FB15k benchmark, surpassing prior approaches by more than 5.7%.
• Our analysis shows that the multi-step inference is crucial to the performance of our model.

Knowledge Base Completion Task
The goal of Knowledge Base Completion (KBC) tasks is to predict a head or a tail entity given the relation type and the other entity, i.e. predicting the head entity h given a triplet (?, R, t) with relation R and tail entity t, or predicting the tail entity t given a triplet (h, R, ?) with head entity h and relation R, where ?denotes the missing entity.
Early work on KBC focuses on learning symbolic rules.Schoenmackers et al. (2010) learns inference rules from a sequence of triplets, e.g., (X, COUNTRYOFHEADQUARTERS, Y) is implied by (X, ISBASEDIN, A) and (A, STATELOCATEDIN, B) and (B, COUNTRYLOCATEDIN, Y).However, enumerating all possible relations is intractable when the knowledge base is large, since the number of distinct sequences of triplets increases rapidly with the number of relation types.Also, the rules-based methods cannot be generalized to paraphrase alternations.
Recent approaches (Bordes et al., 2013;Socher et al., 2013) achieve better generalization by operating on embedding representations, where the vector similarity can be regarded as semantic similarity.
In the evaluation, models compute the similarity between the output prediction and all entities.Mean rank and precision of the target entity are used as metrics for evaluation.

Proposed Model
Our proposed model uses the same setup as in the embedding type of approaches (Bordes et al., 2013;Socher et al., 2013), i.e., the model first takes a triplet with a missing entity, (h, R, ?), as input, then maps the input into the neural space through embeddings, and finally outputs a prediction vector of the missing entity.Given that our model is a neural model, we use the encoder module to transform the input triplet (h, R, ?) to a continuous representation.For generating the prediction results, the decoder module takes the generated continuous representation and outputs a predicted vector, which can be used to find the nearest entity embedding.Basically, we use encoder and decoder modules to convert the tasks between symbolic space and neural space.
The main differences between our model and previous proposed models is that we make the prediction several times while forming multiple intermediate continuous representations along the way.Given an intermediate representation, the controller judges if the representation encodes enough information for us to produce the output prediction or not.If the controller agrees, we produce the current prediction as our final output.Otherwise, the controller generates a new continuous representation by taking current representation and a context vector generated by accessing the shared memory.Then the new presentation will be fed into the controller, and the whole process is performed repeatedly until the controller stops the process.
Note that the number of steps varies according to the complexity of each example.

Inference
Encoder/Decoder Given an input (h, R, ?), the encoder module retrieves the entity h and relation R embeddings from an embedding matrix, and then concatenates the two vectors as the intermediate representation s 1 .
The decoder module outputs a prediction vector f o (s t ) = tanh(W o s t + b o ) based on the intermediate representation s t , which is a nonlinear projection from the controller hidden state and W o and b o are the weight matrix and bias vector, respectively.W o is a k-by-n matrix, where k is the number of the possible entities, and n is the dimension of the hidden vector s t .

Shared Memory
The shared memory is denoted as i=1 , which consists of a list of vectors.During training, the shared memory, which is shared across all training instances, is first randomly initialized and then is jointed learned with the controller on training data.
Controller The controller has two roles in our model.First, it needs to judge if the process should be stopped.If yes, the output will be generated.Otherwise, it needs to generate a new representation based on previous representation and the context vector generated from shared memory.The controller is a recurrent neural network and controls the process by keeping internal state sequences to track the current search process and history.The controller uses an attention mechanism to fetch information from relevant memory vectors in M , and decides if the model should output the prediction or continue to update the input vector in the next step.
To judge the process should be continued or not, the model estimates P (stop|s t ) by a logistical regression module: sigmoid(W c s t + b c ), where the weight matrix W c and bias vector b c are learned during training.With probability P (stop|s t ), the process will be stopped, and the decoder will be called to generate the output.
With probability 1 − P (stop|s t ), the controller needs to generate the next representation s t+1 = RNN(s t , x t ).The attention vector x t at t-th step is generated based on the current internal state s t and the shared memory M .Specifically, the attention score a t,i on a memory vector m i given a state s t is computed as where λ is set to 10 in our experiments and the weight matrices W 1 and W 2 are learned during training.The attention vector x t can be written as Overall Process The inference process is formally described in Algorithm 1.Given input (Obama, NATIONALITY, ?), the encoder module converts it to a vector s 1 by concatenating entity/relation embedding lookup.Second, at step t, with probability P (stop|s t ), model outputs the prediction vector o i .With probability 1 − P (stop|s t ), the state s i+1 is updated based on the previous state s i and the vector x t generated by performing attention over the shared memory.
We iterate the process till a predefined maximum step T max .At test time, the model outputs a prediction o j where the step j has the maximum termination probability.Note that the overall framework is generic to different applications by tailoring the encoder/decoder to a target application.An example of shortest path synthesis task is shown in Appendix B.

Training Objectives
In this section, we introduce the training objectives to train our model.While our process is stochastic,  (Williams, 1992).
The expected reward at step t can be obtained as follows.At t-step, given the representation vector s t , the model generates the output score o t as f o (s t ).We convert the output score to a probability by the following steps.The probability of selecting a prediction ŷ ∈ D is approximated as p(ŷ|o t ) = exp(−γd(ot,ŷ)) y k ∈D exp(−γd(ot,y k )) , where d(o, y) = o − y 1 is the L 1 distance between the output o and the target entity y, and D is the set of all possible entities.In our experiments, we set γ to 5 and sample 20 negative examples in D to speed up training.Assume that ground truth target entity embedding is y * , the expected reward at time t is: , where R is the reward function, and we assign the reward to be 1 when we make a correct prediction on the target entity, and 0 otherwise.Next, we can calculate the reward by summing them over each step.The overall probability of model terminated at time t is Π t−1 i=1 (1−v i )v t , where v i = P (stop|s i , θ).Therefore, the overall objective function can be written as Then, the parameters can be updated through backpropagation.

Motivating Examples
We now describe the motivating examples to explain the design of shared memory that implicitly stores KB information and the design of the controller that implicitly learns the inference procedure.
Shared Memory Suppose, in a KBC task, the input is (Obama, NATIONALITY, ?) and the model is required to answer the missing entity (answer: U.S.A).Our model can learn to utilize and store information in the shared memory through the controller.When a new information from a new instance is received (e.g., (Obama, BORNIN, Hawaii)), the model first uses its controller to find relevant information (e.g., (Hawaii, PARTOF, U.S.A)).If the relevant information is not found, the model learns to store the information to memory vectors by gradient update in order to answer the missing entity correctly.Due to the limited size of the shared memory, the model cannot store all new information explicitly.Thus, the model needs to learn to utilize the shared memory efficiently to lower the training loss.If a related information from a new instance is received, the model learns to do inference by utilizing the controller to go over existing memory vectors iteratively.In this way, the model could learn to do inference and correlate training instances via memory cells without explicitly storing new information.
Controller The design of the controller allows the model to iteratively reformulate its representation through incorporating context information retrieved from the shared memory.Without explicitly providing human-designed inference procedure, during the iterative progress, the controller needs to explore the multi-step inference procedure on its own.Suppose a given input triplet is not able to be resolved in one step.The controller needs to utilize its reformulation capability to explore different representations and make a prediction correctly in order to lower the training loss.

Experimental Results
In this section, we evaluate the performance of our model on the benchmark FB15k and WN18 datasets for KBC (Bordes et al., 2013).These datasets contain multi-relations between head and tail entities.Given a head entity and a relation, the model produces a ranked list of the entities according to the score of the entity being the tail entity of this triple.To evaluate the ranking, we report mean rank (MR), the mean of rank of the correct entity across the test examples, and hits@10, the proportion of correct entities ranked in the top-10 predictions.Lower MR or higher hits@10 indicates a better prediction performance.We follow the evaluation protocol in Bordes et al. (2013) to report filtered results, where negative examples N are removed from the dataset.In this case, we avoid some negative examples being valid and ranked above the target triplet.We use the same hyper-parameters of our model for both FB15k and WN18 datasets.Entity embeddings (which are not shared between input and output modules) and relation embedding are both 100-dimensions.We use the encoder module and decoder module to encode input entities and relations, and output entities, respectively.There are 64 memory vectors with 200 dimensions each, initialized by random vectors with unit L 2 -norm.We use single-layer GRU with 200 cells as the search controller.We set the maximum inference step of the IRN to 5. We randomly initialize all model parameters, and use SGD as the training algorithm with mini-batch size of 64.We set the learning rate to a constant number, 0.01.To prevent the model from learning a trivial solution by increasing entity embeddings norms, we follow Bordes et al. (2013) to enforce the L 2 -norm of the entity embeddings as 1.We use hits@10 as the validation metric for the IRN.Following the work (Lin et al., 2015a), we add reverse relations into the training triplet set to increase the training data.
Following Nguyen et al. ( 2016), we divide the results of previous work into two groups.The first group contains the models that directly optimize a scoring function for the triples in a knowledge base without using extra information.The second group of models make uses of additional information from multi-step relations.For example, RTransE (García-Durán et al., 2015) and PTransE (Lin et al., 2015a) models are extensions of the TransE (Bordes et al., 2013) model by explicitly exploring multi-step relations in the knowledge base to regularize the trained embeddings.The NLFeat model (Toutanova et al., 2015) is a log-linear model that makes use of simple node and link features.
We evaluate hits@10 results on FB15k with respect to the relation categories.Following the evaluation in Bordes et al. (2013), we categorize the relations according to the cardinalities of their associated head and tail entities in four types: 1-1, 1-Many, Many-1, and Many-Many.A given relation is 1-1 if a head entity can appear with at most one tail entity, 1-Many if a head entity can appear with many tail entities, Many-1 if multiple heads can appear with the same tail entity, and Many-Many if multiple head entities can appear * Nguyen et al. ( 2016) reported two results on WN18, where the first one is obtained by choosing to optimize hits@10 on the validation set, and second one is obtained by choosing to optimize MR on the validation set.We list both of them in Table 1.
with multiple tail entities.The detailed results are shown in Table 3.The IRN significantly improves the hits@10 results in the Many-1 category on predicting the head entity (18.8%), the 1-Many category on predicting the tail entity (16.5%), and the Many-Many category (over 8% in average).
In order to show the inference procedure determined by IRNs, we map the representation s t back to human-interpretable entity and relation names in the KB.In Table 4, we show a randomly sampled example with its top-3 closest triplets (h, R, ?) in terms of L 2 -distance, and top-3 answer predictions along with the termination probability at each step.Throughout our observation, the inference procedure is quite different from the traditional inference chain that people designed in the symbolic space (Schoenmackers et al., 2010).The potential reason is that IRNs operate in the neural space.Instead of connecting triplets that share exactly the same entity as in the symbolic space, IRNs update the representations and connects other triplets in the semantic space instead.As we can observe in the examples of Table 4, the model reformulates the representation s t at each step and gradually increases the ranking score of the correct tail entity with higher termination probability during the inference process.In the last step of Table 4, the closest tuple (Phoenix Suns, /BASKETBALL_ROSTER_POSITION/POSITION) is actually within the training set with a tail entity Forward-center, which is the same as the target entity.Hence, the whole inference process can be thought as the model iteratively reformulates the representations in order to minimize its distance to the target entity in neural space.
To understand what the model has learned in the shared memory in the KBC tasks, in Table 5, Table 4: Interpret the state s t in each step via finding the closest (entity, relation) tuple, and the corresponding the top-3 predictions and termination probability."Rank" stands for the rank of the target entity and "Term.Prob." stands for termination probability.we visualize the shared memory in an IRN trained from FB15k.We compute the average attention scores of each relation type on each memory cell.In the table, we show the top 8 relations, ranked by the average attention scores, of some memory cells.These memory cells are activated by certain semantic patterns within the knowledge graph.It suggests that the shared memory can efficiently capture the relationships implicitly.We can still see a few noisy relations in each clustered memory cell, e.g., "bridge-player-teammates/teammate" relation in the "film" memory cell, and "olympic-medalhonor/medalist" in the "disease" memory cell.
We provide some more IRN prediction examples at each step from FB15k as shown in Appendix A. In addition to the KBC tasks, we construct a synthetic task, shortest path synthesis, to evaluate the inference capability over a shared memory as shown in the Appendix B.

Related Work
Link Prediction and Knowledge Base Completion Given that R is a relation, h is the head entity, and t is the tail entity, most of the embedding models for link prediction focus on finding the scoring function f r (h, t) that represents the implausibility of a triple.(Bordes et al., 2011(Bordes et al., , 2014(Bordes et al., , 2013;;Wang et al., 2014;Ji et al., 2015;Nguyen et al., 2016).In many studies, the scoring function f r (h, t) is linear or bi-linear.For example, in TransE (Bordes et al., 2013), the function is implemented as f r (h, t) = h + r − t , where h, r and t are the corresponding vector representations.
Recently, different studies (Guu et al., 2015;Lin et al., 2015a;Toutanova et al., 2016) demonstrate the importance for models to also learn from multistep relations.Learning from multi-step relations injects the structured relationships between triples into the model.However, this also poses a technical challenge of considering exponential numbers of multi-step relationships.Prior approaches address this issue by designing path-mining algorithms (Lin et al., 2015a) or considering all possible paths using a dynamic programming algorithm with the restriction of using linear or bi-linear models only (Toutanova et al., 2016).Toutanova and Chen (2015) shows the effectiveness of using simple node and link features that encode structured information on FB15k and WN18.In our work, the IRN outperforms prior results and shows that similar information can be captured by the model without explicitly designing inference procedures on the observed triplets.Our model can be regarded as a recursive function that iteratively update the representation in such a way that its distance to the target entity in the neural space is minimized, i.e., f IRN (h, r) − t .Studies such as (Riedel et al., 2013) show that incorporating textual information can further improve the KBC tasks.It would be interesting to incorporate the information outside the knowledge bases in our model in the future.
Neural Frameworks Sequence-to-sequence models (Sutskever et al., 2014;Cho et al., 2014) have shown to be successful in many applications such as machine translation and conversation modeling (Sordoni et al., 2015).While sequence-tosequence models are powerful, recent work has shown the necessity of incorporating an external memory to perform inference in simple algorithmic tasks (Graves et al., 2014(Graves et al., , 2016)).
Compared IRNs to Memory Networks (MemNN) (Weston et al., 2014;Sukhbaatar et al., 2015;Miller et al., 2016) and Neural Turing Machines (NTM) (Graves et al., 2014(Graves et al., , 2016)), the biggest difference between our model and the existing frameworks is the controller and the use of the shared memory.We follow Shen et al. (2016) for using a controller module to dynamically perform a multi-step inference depending on the complexity of the instance.MemNN and NTM explicitly store inputs (such as graph definition, supporting facts) in the memory.In contrast, in IRNs, we do not explicitly store all the observed inputs in the shared memory.Instead, we directly operate on the shared memory, which modeling the structured relationships implicitly.During training, we randomly initialize the memory and update the memory jointly with the controller with respect to task-specific objectives.

Conclusion
In this paper, we propose Implicit ReasoNets (IRNs), which perform inference over a shared memory that stores large-scale structured relationships implicitly.The inference process is guided by a controller to access the memory that is shared across instances.We demonstrate and analyze the multi-step inference capability of IRNs in the knowledge base completion tasks.Our model, without using any explicit knowledge base information in the inference procedure, outperforms all prior approaches on the popular FB15k benchmark by more than 5.7%.
For future work, we aim to further extend IRNs in two ways.First, inspired from Ribeiro et al. ( 2016), we would like to develop techniques to exploit ways to generate human understandable reasoning interpretation from the shared memory.Second, we plan to apply IRNs to infer the relationships in unstructured data such as natural language.For example, given a natural language query such as "are rabbits animals?", the model can infer a natural language answer implicitly in the shared memory without performing inference directly on top of huge amounts of observed sentences such as "all mammals are animals" and "rabbits are animals".We believe that the ability to perform inference implicitly is crucial for modeling large-scale structured relationships.

Figure 1 :
Figure 1: An overview of the IRN for KBC tasks.

Figure 2 :
Figure2:A running example of the IRN architecture.Given the input (Obama, CITIZENSHIP, ?), the model iteratively reformulates the input vector via the current input vector and the attention vector over the shared memory, and determines to stop when an answer is found.
Algorithm 1 Inference Process of IRNs Lookup entity and relation embeddings, h and r.Set s 1 = [h, r] Encoder while True do u ∼ [0, 1] if u > P (stop|s t ) then x t = f att (s t , M ) Access Memory s t+1 = RNN(s t , x t ), t ← t + 1 else Generate output o t = f o (s t ) Decoder break Stop end if end while the model mainly needs to decide the number of steps for generating the intermediate representations for each example.Since the number of steps the model should take for each example is unknown in the training data, we optimize the expected reward directly, motivated by the REINFORCE algorithm

Table 1 :
The knowledge base completion (link prediction) results on WN18 and FB15k.

Table 2 :
The performance of IRNs with different memory sizes and inference steps on FB15k, where |M | and T max represent the number of memory vectors and the maximum inference step, respectively.

Table 5 :
Shared memory visualization in an IRN trained on FB15k, where we show the top 8 relations, ranked by the average attention scores, of some memory cells.The first row in each column represents the interpreted relation.