Neural Wikipedian: Generating Textual Summaries from Knowledge Base Triples

Most people do not interact with Semantic Web data directly. Unless they have the expertise to understand the underlying technology, they need textual or visual interfaces to help them make sense of it. We explore the problem of generating natural language summaries for Semantic Web data. This is non-trivial, especially in an open-domain context. To address this problem, we explore the use of neural networks. Our system encodes the information from a set of triples into a vector of fixed dimensionality and generates a textual summary by conditioning the output on the encoded vector. We train and evaluate our models on two corpora of loosely aligned Wikipedia snippets and DBpedia and Wikidata triples with promising results.


Introduction
While Semantic Web data, such as triples in Resource Description Framework (RDF), is easily accessible by machines, it is difficult to be understood by people who are unfamiliar with the underlying technology.On the contrary, for humans, reading text is a much more accessible activity.In the context of the Semantic Web, Natural Language Generation (NLG) is concerned with the implementation of textual interfaces that would effectively increase humans' accessibility to the information that is stored in the knowledge bases' triples.Further development of systems for NLG could be beneficial in a great range of application domains.Examples include Question Answering platforms whose users' experience could be improved by the ability to automatically generate a textual description of an entity that is returned at a user's query (e.g. the Google Knowledge Graph 1 and the Wikidata Reasonator 2 ), or dialogue systems in commercial environments that could be enhanced further in order to generate responses that better address the users' questions [1].
So far, research has mostly focused on adapting rulebased approaches to generate text from Semantic Web data.These systems worked in domains with small vocabularies and restricted linguistic variability, such as football match summaries [2] and museum exhibits' descriptions [3].However, the tedious repetition of their textual patterns along with the difficulty of transferring the involved rules across different domains or languages prevented them from becoming widely accepted [4].
We address the above limitations by proposing a statistical model for NLG using neural networks.Our work explores how an adaptation of the encoder-decoder framework [5,6] could be used to generate textual summaries for triples.More specifically, given a set of triples about an entity (i.e. the entity appears as the subject or the object of the triples), our task consists in summarising them in the form of comprehensible text.We propose a model that consists of a feed-forward architecture that encodes each triple from an input set of triples in a vector of fixed dimensionality in a continuous semantic space, and an RNNbased decoder that generates the textual summary one word at a time.Our model jointly learns unique vector representations "embeddings" for entities and words that exist in the text, and predicates and entities as they occur in the corresponding triples.In contrast with less flexible, rule-based strategies for NLG, our approach does not constrain the number of potential relations between the triples' predicates and the generated text.Consequently, a learnt predicate embedding, given its position in the se-mantic space, can be expressed in an varied number of different ways in the text.
Training data for NLG models is not always readily available; this applies to the context of Semantic Web as well.The difficulty is that data that is available in knowledge bases needs to be aligned with the corresponding texts.Existing solutions for data-to-text generation either focus mainly on creating a small, domain-specific corpus where data and text are manually aligned by a small group of experts, such as the WeatherGov [7] and RoboCup [8] datasets, or rely heavily on crowdsourcing [9], which makes them costly to apply for large domains.Our second contribution is an automatic approach for building a large datato-text corpus of rich linguistic variability.We rely on the alignment of DBpedia and Wikidata with Wikipedia in order to create two corpora of knowledge base triples from DBpedia and Wikidata, and their corresponding textual summaries.For the purpose of this paper, we chose to retrieve articles about people and their biographies [10].We extracted two different corpora with vocabularies of over 400k words that consist of: (i) 260k Wikipedia summaries aligned with a total of 2.7M DBpedia triples, and (ii) 360k Wikipedia summaries allocated to a total of 43M Wikidata triples.
Our proposed model learns to generate a textual summary as a sequence of words and entities.We experiment with two different approaches, one rule-based, and one statistical, in order to infer the verbalisation of the predicted entities in the generated summary.Conventional systems based on neural networks when employed on NLG tasks, such as Machine Translation [6] or Question Generation [11] are incapable of learning high quality vector representation for the infrequent tokens (i.e.either words or entities) in their training dataset.Inspired by [12,11], we address this problem by adapting a multi-placeholder method that enables the model to emit special tokens that map a rare entity in the text to its corresponding triple in the input set.We use perplexity, and the BLEU and ROUGE metrics in order to automatically evaluate our approach's ability of predicting the Wikipedia summary that corresponds to a set of unknown triples showing substantial improvement over our baselines.Furthermore, we evaluate a set of generated summaries against human evaluation criteria.Based on the average rating across our selected criteria, we conclude that our approach is able to generate coherent textual summaries that address most of the information that is encoded in the input triples.Lastly, we demonstrate our method's capability to successfully infer semantic relationships among entities by computing the nearest neighbours of the learned embeddings of respective entities in our datasets.
The structure of the paper is as follows.Section 2 discusses existing approaches to NLG and the Semantic Web, and relates them to our model.Section 3 presents the components of our approach.Section 4 describes the construction of our datasets.Section 5 presents experiments and an evaluation of the model.Section 6 summarises the contributions of this work and outlines future plans.

Related Work
Models for NLG can be divided into two groups: statistical and rule-based ones [13].The latter employ linguistic expertise and work in three different phases:(i) document planning or content selection, (ii) microplanning and (iii) surface realisation [13,4].During document planning the information that will be communicated in the text is selected and organised (i.e.document structuring).The output of the document planner is used by the microplanner to decide how this information should be linguistically expressed in the generated text.Subsequently, the realiser generates the actual text by applying a specific template that satisfies the linguistic requirements that are set by the microplanner, and expresses the information as it is structured by the document planner.Each one of the above mentioned stages is associated almost explicitly not only with the domain of the end-application but, in most cases, with the application itself.
Most of the previous work on NLG with Semantic Web data has focused on the verbalisation of domain ontologies by using rules.Examples include systems that generate text in domains with limited linguistic variability, such as clinical narratives [14], summaries of football matches [2], and, descriptions of museum's exhibits [3].Further Semantic Web oriented NLG applications can be found in [4].Our work naturally lies on the path opened by recent unsupervised [15] and distant-supervision [16] based approaches for the extraction of RDF verbalisation templates using parallel data-to-text corpora.However, rather than making a prediction about the template that would be the most appropriate to verbalise a set of input triples, our model jointly performs content selection and surface realisation, without the inclusion of any hand-coded rules or templates.
Previous work on neural network approaches shows their great potential at tackling a wide variety of NLP tasks ranging from machine translation [5,6] to automatic response generation [17,18], and to computing vector representations of words in a continuous semantic space [19].Our approach is inspired by the general encoder-decoder framework [5,6] with multi-gated Recurrent Neural Network (RNN) variants, such as the Gated Recurrent Unit (GRU) [5] and the Long Short-Term Memory (LSTM) cell [20].Adaptations of this framework have demonstrated state-of-the-art performance in many generative tasks, such as machine translation [5,6,21], and conversation modelling and response generation [22,17].
Implementations based on the encoder-decoder framework work by mapping sequences of source tokens to sequences of target tokens.We adapt the Sequence-to-Sequence model to the requirements of Semantic Web Data.Since sets of triples are unordered, and not sequentially correlated, in the next section we propose a model that consists of a feed-forward neural network that encodes each input triple into a vector of fixed dimensionality in a continuous semantic space.Within this space, triples that have similar semantic meaning will have similar positions.We couple this novel encoder with an RNN-based decoder that generates the textual summary one token (i.e. a token can be a word or an entity or a surface form of an entity) at a time.
Our task is most similar to recent work by Lebret et al. and Chisholm et al., who both employ adaptations of the encoder-decoder framework in order to generate the first sentence of a Wikipedia biography [10,23].Lebret et al. propose a system that it generates a summary given an input in the form of a Wikipedia infobox; the model proposed by Chisholm generates a biography given a sequence of slot-value pairs extracted from Wikidata.The representation of the input of both these approaches is essentially limited to expressing only one-subject relationships.In our case, the input triples set that is allocated to each Wikipedia summary is made of more than just the DBpedia or Wikidata triples of the corresponding Wikipedia article.As we note in more detail in Section 4.2, this triple set also includes triples with entities that are related with the main discussed entity of a Wikipedia biography in the respective knowledge base, and their object is the main subject of the Wikipedia summary.Furthermore, we believe that constraining the generative process to only the first sentence significantly simplifies the task in terms of the amount of information (i.e. in our case number of triples) that is lexicalised in the summary.Consequently, we choose to train on longer snippets of text to generate more elaborate summaries.

Our Model
An idealised example of our NLG task is presented in Table 1; our system takes as an input a set of triples about the entity Walt Disney (i.e. the entity Walt Disney is either the subject or object of the triples in the set), and generates a sequence of words in order to summarise them in the form of comprehensible text.Given a set of E triples, F = {f 1 , f 2 , . . ., f E }, our goal is to learn a model that is able to generate a sequence of T tokens, Y = y 1 , y 2 , . . ., y T .We regard Y as a representation in natural language of the input set of triples, and we build a model that computes the probability of generating y 1 , y 2 , . . ., y T , given the initial set of triples f 1 , f 2 , . . ., f E : p(y t |y 1 , . . .y t−1 , F ) . (1) Figure 1: The triple encoder computes a vector representation for each one of the two input triples h f 1 and h f 2 .Subsequently, the decoder uses the concatenation of the two vectors, [h f 1 ; h f 2 ] to initialise the decoding process that generates the summary word per word.Each textual summary starts and ends with the respective start-of-sequence <start> and end-of-sequence <end> tokens.
Our model consists of a feed-forward architecture that encodes each triple from the input set into a vector of fixed dimensionality in a continuous semantic space.This is coupled to an RNN-based decoder that generates the textual summary one token (i.e. a token can be a word or an entity or a surface form of an entity) at a time.Note that since bias terms can be included in each weightmatrix multiplication, they are not explicitly shown in the equations that describe the models of this section.The architecture of our generative model is shown in Figure 1.

Triple Encoder
Let F = {f 1 , f 2 , . . ., f E : f i = (s i , p i , o i )} be the set of triples f 1 , . . ., f E , where s i , p i and o i are the one-hot3 vector representations of the respective subject, predicate and object of the i-th statement.The vector representation h fi of the i-th triple is computed by forward propagating the triples encoder as follows: where ReLU is the rectifier (i.e.non-linear activation function), [. . .; . ..] represents vector concatenation, W x→h : R |N | → R m is a trainable weight matrix that represents a biased linear mapping, where |N | is the cardinality of all the potential one-hot input vectors (i.e.size of the dictionary of all the available predicates and entities of the triples dictionary), and W h→h : R 3m → R m is an unbiased linear mapping.

Decoder
After the vector representation h fi for each triple f i is obtained, we start the decoding process during which the corresponding textual summary is generated.At each timestep t the decoder makes a prediction about the next token that will be appended to the summary by taking into consideration both the tokens that have already been generated, and the contextual knowledge from the triples that have been provided to the system initially as input.We experiment with two commonly used RNN variants:(i) the LSTM cell and (ii) the GRU, in order to explore which one works best for the decoding needs of our architecture.
We initialise the decoder with a fixed-length vector that we obtain after encoding all the information from the vector representations of the triples.Our approach is inspired by the general Sequence-to-Sequence framework, within which an RNN-based encoder encapsulates the information that exists in a sequence, and an RNN-based decoder that generates a new sequence from this encapsulation [5,6].However, since the triples that we use in our problem are not sequentially correlated, we propose a concatenation-based formulation that enables us to capture the information across all the triples that are given as an input to our system into one single vector.More specifically, given a set of triples' vector representations, h f1 , . . ., h f E , we compute: where W h F →h 1 0 : R Em → R m is a biased linear mapping.Subsequently, the hidden units of the LSTM-or GRUbased decoder (discussed below) at layer depth l = 1 are initialised with h 1 0 = h F .Let h l t ∈ R m be the aggregated output of a hidden unit at timestep t = 1...T and layer depth l = 1...L.The vectors at zero layer depth, h 0 t = W x→h x t , represent the words or entities that are given to the network as an input.The parameter matrix W x→h has dimensions [|X|, m], where |X| is the cardinality of all the potential one-hot input vectors (i.e.size of the dictionary of all the available words and entities of the textual summaries dictionary).All subsequent matrices have dimension [m, m] unless stated otherwise.

Long Short-Term Memory (LSTM).
We adopt the architecture from [24]: where W l : R 4m → R 2m is a biased linear mapping, and in l t , f l t , out l t and c l t are the vectors at timestep t and layer depth l that correspond to the input gate, the forget gate, the output gate and the cell respectively.

Gated Recurrent Unit (GRU).
The GRU is a less complex variant of the LSTM cell [5] with comparable performance [25].
where W l : R 2m → R 2m is a biased linear mapping, and r l t , u l t and h l t are the vectors at timestep t and layer depth l that represent the values of the reset gate, the update gate and the hidden candidate respectively.

Model Training
The conditional probability distribution over the each token of the summary at each timestep t is represented with the softmax function over all the entries in the textual summaries dictionary: where W y : R m → R |X| is a biased trainable weight matrix.Our model learns to make a prediction about the next token by using the negative cross-entropy4 criterion.
During training and given a set of triples, our model makes a prediction about the sequence of tokens of which the generated summary is comprised.The model computes how far the generated sequence of tokens is from the empirical, actual text by utilising the negative logarithmic probability of the generated summary given set of triples: Consequently, our model tries to minimise the above cost function.This non-convex optimisation problem is solved using the RMSProp5 algorithm.

Generating Summaries
During testing, our goal is to find: where y * is the optimal summary computed by the model.Recall from Eq. 12 that at each timestep, our model predicts a probability distribution over the token that is more likely to come next.In theory, Viterbi decoding could approximate an optimal summary.However, in practice, the fact that the target vocabulary |X| is large enough deems such an approach intractable.A different approach is to approximate the best summary by appending the token with the highest probability at each timestep of the generation process.Even though, such greedy decoders have proven to be very fast when employed in machine translation problems, they tend to produce low quality approximations [27].
A compromise between a strictly-greedy decoding algorithm and Viterbi is to adopt a beam-search decoder [6], which provides us with the B-most-probable summaries or hypotheses given a set of triples F as input.The decoder maintains only a small number of B hypotheses (i.e.partially completed summaries) and which it extends at every timestep with every token in the target vocabulary |X|.
During testing, we provide our network with an unknown set of triples, and we initialise the decoder with a special start-of-summary <start> token.The B tokens with the highest probability are used as separate inputs to the decoder at the second timestep.This leads to B|X| partial hypotheses from which we only retain the B-best.After all the second words of our hypotheses are provided as an input to the decoder, we end up with B|X| partial three-worded hypotheses from which again we only keep the B ones with the highest probability.When the endof-summary <end> token is predicted for a hypothesis, it is appended to the list of complete summaries, and the process carries on with B = B − 1.An idealised example of a beam-search decoder is displayed in Figure 2.

Datasets
In order to address the training needs of our proposed model, we build two datasets of aligned knowledge base triples with texts.For the first dataset, we leverage the intrinsic alignment of DBpedia and Wikipedia in order to create a corpus of loosely aligned triples and textual summaries.In our second scenario, following [Elsahar et al. 2017], we align Wikipedia summaries with the communitycurated triples of Wikidata.
Inspired by Lebret et al. , we chose a corpus about biographies.Biographies represent one of the largest single domains in Wikipedia, providing us with a substantial amount of training data [cite?].We believe they offer the necessary diversity and linguistic variability that will allow us to explore effectively the generative ability of our systems.In addition, the fact that the biographies tend to adopt a limited number of structural paradigms, pro-vides us with the opportunity to better understand the limitations of our approach.
By using PetScan6 , we collected a list of 1, 479, 170 Wikipedia articles that have been detected by the WikiProject Biography7 .We then extracted the Wikipedia summaries and their corresponding DBpedia triples from the Mapping-based Objects8 and Literals 8 DBpedia dataset, retaining only articles for which an infobox exists.For the Wikidata version of the dataset, we used the Wikidata truthy dumps 9 and we kept only items for which Wikidata triples exist.
In addition to the above datasets, we also leverage two DBpedia datasets:(i) the Instance Types 8 and (ii) the Genders 8 datasets.The first one is used to provide us with special tokens for the entities that occur infrequently in our aligned datasets, and the second in order for us to append a gender-related triple to the DBpedia triples that have been already allocated to an article.Since co-reference resolution is not performed as a data pre-processing stage, our hypothesis is that the additional knowledge from the inclusion of gender-related triples will increase the model's awareness towards the gender of the main discussed entity of an article.Please note that the Genders dataset is used for the DBpedia version of the aligned dataset, in which gender-related triples are extremely sparse.

Wikipedia Summaries
One of the main challenges that is associated with the alignment of triples from a structured knowledge base with texts is the identification of how the entities of the knowledge base are mentioned in the text.For instance in the Wikipedia sentence: "Barack Hussein Obama II is an American politician who served as the 44th President of the United States from 2009 to 2017." 10 we need to be able to identify that the surface forms of "Barack Hussein Obama II" and "United States" refer to the respective DBpedia resources of dbr:Barack Obama and dbr:United -States.In order to sidestep this problem, we use DBpedia Spotlight [28], an automatic system for annotation of DBpedia entities in text.Confidence and support are the two main variables that parameterise the annotation results that are returned by DBpedia Spotlight.The first one is the lowest threshold of certainty which the system must have in order to return an annotation, and the latter is the lowest bound of the un-normalised total number of links to the returned entities.
We run each one of the extracted Wikipedia summaries through DBpedia Spotlight.Our goal was to find the combination that provides the greatest number of relevant annotations, in order to(i) enhance the set of triples allocated to each Wikipedia page more effectively, and to (ii) allow the model to learn directly how entities in the triples on the encoder side manifest themselves in the text on the decoder side.We empirically found that by setting the confidence and support parameters to 0.35 and −1 respectively, we increased the recall of the identified resources while maintaining the precision at acceptable levels.We retained a list of all the possible surface forms to which each entity was mapped.Furthermore, given the nature of our problem, we excluded any Wikipedia summaries whose main discussed entity was not identified in the text.Each Wikipedia summary is tokenised and split into sentences using the Natural Language Toolkit (NLTK) [29].

Knowledge Base Triples
Our text generation task consists of learning how entities, along with their relationships, are mentioned in the text.Given a set of triples, our approach learns to generate text one token at a time, without constraining the generative procedure to pre-defined templates that would include a given textual string as-it-is in the generated summary.Consequently, we excluded from our corpus any triples with a textual string as their object, except those that referred to numbers, dates or years.All instances of number-objects are replaced with the special token 0, except for year-objects that are mapped to the special <year> token [10].In both Wikidata and DBpedia date-related objects are expressed as a string followed by its corresponding XML Schema URI (e.g.XMLSchema/-#dateTime or XMLSchema/#date).In order to enable our model to process date-related triples and learn how their information is lexicalised in the text, we decompose them into two different triples.The first one is used to represent the month as it has been identified in the original triple, and the second one to represent the year.The object of the latter is subsequently mapped to the special <year> token.Table 2 presents an example of our date encoding approach.
For each entity that has been identified in a Wikipedia summary using DBpedia Spotlight, we extracted its corresponding triples from the Mapping-based Objects dataset in the DBpedia's case, and the Wikidata truthy dump in Wikidata's case.We assume that the subjects or objects of a set of triples are consistent with the main subject of the corresponding Wikipedia summary.Consequently, from this additional set of triples we only retain those whose object matches the main discussed entity in each summary, and we append them to the initial set.This results in 450 and 609k unique predicates and entities in DBpedia's case and in 378 and 378k unique predicates and entities respectively in Wikidata's case.

Aligned Dataset
We built two aligned datasets that consist of:(i) 256850 instances of Wikipedia summaries aligned with 2.74M DBpedia triples, and (ii) 358908 instances of Wikipedia summaries aligned with the total of 4.34M Wikidata triples respectively.The size difference of our datasets is explained in a twofold manner.Firstly, there are Wikipedia biographies without an infobox (i.e.and, thus, without any available triples in the Mapping-based Objects and Literals DBpedia datasets).Secondly, even if they do have an infobox, the retrieved triples that are made available in the DBpedia dumps might not meet the requirements of our task (i.e.Section 4.2).For example in case the objects of all the triples that allocated to a Wikipedia biography are strings other than dates or number then this summary would be excluded from the respective aligned dataset.
We describe next all the pre-processing steps that we followed in order to make our aligned datasets fit for the training our neural network architectures.

Modelling the Generated Summaries
We retained only the first two sentences of each summary in order to reduce the computational cost of our task; summaries that consist of only one sentence were included unaltered.Since it would be impossible to learn a unique vector representation for the main discussed entity of each Wikipedia summary due to the lack of occurrences of the majority of those entities in the datasets, we replaced them with the special <item> token.We used a fixed vocabulary of 30000 and 32000 of the most frequent tokens (i.e.either words or entities) of the summaries that are aligned with the respective DBpedia and Wikidata triples.Similarly to the input triples (i.e.Section 4.2), all occurrences of numbers in the text are replaced with the special token 0, except for year-objects that are mapped to the special <year> token [10].Every out-of-vocabulary word is represented by the special <rare> token.
Using a single special token for all the rare entities that have not been included in the fixed target vocabulary would substantially limit the model, causing unnecessary repetition of this particular token in the generated summaries.Inspired by the Multi-Placeholder model [11], we first attempt to match a rare entity that has been annotated in the text, in the subjects or the objects of the allocated triples.In case it exists in the triples, then it is replaced by a placeholder token, which consists from the predicate of the triple, a descriptor of the component of the triple that was matched (i.e.obj or subj ), and the instance type of the entity.The instance type of an entity is obtained from the Instance Types dataset.For example, in the case that the subject of the triple: (dbr:The Adventures of Roderick Random dbo:author dbr:Tobias Smollett) is annotated as a rare entity in the corresponding summary, it is replaced with the special token: dbo:author sub dbo:Book.In the case that a rare entity is matched to the object of the triple: (Kevin Carr dbo:birthPlace dbr:Morpeth, Northumberland) it is replaced with the token: dbo:birthPlaceobj dbo:Settlement.We refer to those placeholders as property-type placeholders.In case the entity does not have a type in the Instance Types dataset, the instance type part of the placeholder is filled by <unk> token (e.g.dbo:birthPlace obj <unk>).If the rare entity is not matched to any subject or object of the set of corresponding triples, then it is replaced by the special token of its instance type.In case the rare entity does not exist in the instance types dataset, it is replaced by the <unk> token.
Note that each summary is augmented with the respective start-of-summary <start> and end-of-summary <end> tokens.

Modelling the Input Triples
Similar to the Wikipedia summaries, we represent the occurrence of the main discussed entity of the corresponding summary as either subject or object of a triple with the special <item> token.A shared, fixed dictionary was used for all subjects, predicates and objects.First, we included all the predicates and entities that occur at least 20 times.Triples with rare predicates are discarded.Every out-ofvocabulary entity is replaced by the special token of its instance type, which is retrieved from the Instance Types dataset.For example the rare entity of dbr:Mamma Mia! is replaced by the dbo:Musical token.In case an infrequent entity is not found in the Instance Types dataset, it is replaced with the special <unk> token.We appended to the source vocabulary only the instance type tokens that occur at least 20 times, and, finally, we used the <resource> token for the rare entities with also infrequent instance types.
In order to increase the homogeneity of the dataset in terms of the number of triples that are aligned with each Wikipedia summary, we limit the number of allocated triples per summary E to: In case a biography is aligned with less triples then it is excluded for the respective dataset.If a summary is aligned with more triples, we first attempt to exclude potential duplicates (e.g.Fiorenzo Magni dbp:proyears 1945 and Fiorenzo Magni dbp:proyears 1944 would result in the same triple: <item> dbp:proyears <year>).
In case their number still exceeds the limit, we retain only the first ones until the threshold is reached.Table 3 shows statistics on the initial and the trainingready version of each corpus.An example of the structure of the datasets is displayed in Table 4.More details about the two different types of summaries (i.e.Summary With URIs and Surface Form tuples) with which we trained our models are provided in Section 5.

Experiments
We use the above datasets of aligned Wikipedia summaries with DBpedia and Wikidata triples in order to train and evaluate the performance of our neural network models.Both datasets are split into training, validation and test with respective portions of 85%, 10%, and 5%.The implemented architectures have been developed using the Torch 11 software package.Any cleaning or restructuring procedure that has been carried out on the datasets has been conducted with Python scripts.The code will be made publicly available with the notification of acceptance.
Our proposed neural network architectures learn to generate a textual summary as a sequence of words and entities.In order to infer the verbalisation of the predicted entities in a generated summary, we experiment with two different approaches which are described in detail below.

Generating Words Along With URIs
In this setup, all the entities that have been annotated in the text with DBpedia Spotlight are replaced with their respective URIs.The summaries vocabulary is comprised of words and the entities' URIs.The model thus learns to generates words along with the URIs of entities.In order to improve the generated text further, as a postprocessing step we replace:(i) the <item> token, with its corresponding surface form, and (ii) tokens of DBpedia or Wikidata entities in the text, with their most frequently matched surface form, as these are recorded during our data pre-processing (i.e.Section 4.1). 11Torch is a scientific computing package for Lua.It is based on the LuaJIT package.

Generating Words Along With Surface Form Tuples
In order to eliminate the post-processing step of replacing the entities' URIs with their most frequently met surface forms, we propose a setup that enables our system to make a prediction about the best verbalisation of a predicted entity in the text.Each entity that has been identified in the text of the Wikipedia summaries using DBpedia Spotlight, is stored as a tuple of the annotated surface form and its URI.Let K = {k 1 , k 2 , . . ., k D } be the set of all the D entities that are annotated in the text.We define the r-th surface form tuple of the d-th entity k d as: , where g r is the r-th surface form that is associated with the entity k d .Similarly to Section 5.1, those tuples are stored as tokens in the target vocabulary.This setup enables the models to verbalise each entity with more than one way by adapting the surface forms to the context of both the generated tokens and input triples.

Training Details
We train two different models.The first one is the triple encoder coupled with the GRU-based decoder to which we refer as Triples2GRU; the other is the same triple encoder coupled with the LSTM-based decoder (Triples2-LSTM).For each dataset, we train each model on our task of generating a summary once as a combination of words with URIs (/w URIs, 5.1) and once as mixture of words and surface form tuples (/w Surf.Form Tuples, 5.2).
For the recurrent component of our networks, we use 1 layer of(i) 650 LSTM cells and (ii) 750 GRUs, resulting in 3.38M and 3.375M recurrent connections respectively.We found that increasing the number of layers does not improve the performance of our architectures, whereas the dimensionality of the hidden states plays a crucial role in achieving the best possible results.Table 5 summarises the hyper-parameters that have been used for the training of our models.
The feed-forward triples encoder consist of a sequence of fully-connected layers with the following [input, output] Table 4: Example of the alignment of our dataset.One Wikipedia summary is coupled with a set of triples from either DBpedia or Wikidata.Any reference to the main discussed entity of the summary (i.e.dbr:Papa Roach or wikidata:Q254371 respectively) is replaced by the special <item> token both in the text and the corresponding triples.Each other entity is stored along with its instance type.In the case of infrequent entities these are replaced with the special token of their instance types both in the text and the triples (e.g."triple platinum" is replaced with dbr:RIAA certification).When a rare entity in the text is matched to an entity of the corresponding triples' set, then it is replaced by a unique token, which consists from the predicate of the triple, a descriptor of the component of the triple that was matched, and the instance type of the entity (e.g. the reference to the music album "Infest (2000)" is replaced with the placeholder [dbo:artist sub dbo:Album]).At the topmost layer of the encoder, we have a fully-connected layer that maps the concatenated hidden representations of all the aligned to a summary triples to one single vector: , where E max is the maximum number of triples per article.Sets of triples with less than E max triples are padded with zero vectors when necessary.We optimised our architectures using an alteration of stochastic gradient descent with adaptive learning rate.We found that a fixed learning rate was resulting in the explosion of the gradients that were propagated to the encoder side of our models.We believe that the above behaviour is explained by the fact that our models learn to project data of dissimilar nature (i.e.structured data from the triples and unstructured text from the summaries) in a shared continuous semantic space.In case their parameters are not initialised properly, our neural architectures propagate vectors of different orders of magnitude leading to the explosion of the gradients phenomenon.However, finding the appropriate values to initialise the models' parameters is certainly not trivial [30].In order to sidestep this problem, we use Batch Normalisation before each nonlinear activation function and after each fully-connected layer both on the encoder and the decoder side, and we initialise all parameters with random uniform distribution between −0.001 and 0.001 [30].The networks were trained with mini-batch RMSProp with an initial learning rate value of 0.002.Each update is computed using a minibatch of 85 dataset instances.An l 2 regularisation term over the parameters is also included in the cost function.After the 3nd epoch, the learning rate was decayed by 0.8 every half epoch.
We trained all of our models on a single Titan X (Pas- cal).LSTM-based models complete an epoch of training: (i) in around 25 minutes when trained on the Wikidata dataset, and (ii) 17 minutes when trained on the DBpedia one; the GRU-based architectures require (i) around 22 minutes when trained on the Wikidata dataset, and (ii) 15 minutes when trained on the DBpedia one.

Automatic Evaluation
We use perplexity 13 , BLEU 14 , and ROUGE 15 on the validation and test set.Perplexity indicates how well the model learns its training object (i.e.Section 3.4), whereas BLEU and ROUGE measure how close is the generated text to the actual Wikipedia summary.Essentially, BLEU and ROUGE are complimentary to each other.The first computes a modified version of n-grams precision 16 , whereas the latter computes n-grams recall, of the automatically generated sentences with respect to the empirical Wikipedia summaries.
We adapt the code from the evaluation package that was released by Peter Anderson17 , which was originally implemented to score textual descriptions from images.Perplexity, BLEU 1, BLEU 2, BLEU 3, BLEU 4, and ROUGE L (i.e. an alteration of the original ROUGE that is automatically measured on the longest common subsequence) results are reported in Table 6.
To demonstrate the effectiveness of our system, we compare it to two baselines.First, we compute expected lower bounds for BLEU scores by using a random Wikipedia summary generation baseline.We regard the latter as a particularly strong baseline, due to the fact that Wikipedia biographies tend to follow a limited number of textual patterns.For each triples set on the validation and test set, the random system generates a response by randomly selecting a Wikipedia summary from our training set.Secondly, we use the KenLM toolkit [33] in order to build a 5-gram Kneser-Ney (KN) language model.During testing, similarly to the case of our neural network approach, for each triple set in the validation and test set, we use beamsearch with a beam of size 10, in order to generate the 10 most probable summaries.We equip both of our baselines with surface form tuples, and the <item> and propertytype placeholders.After a summary is selected, its <item> placeholder along with any potential property-type placeholders are replaced based on the original triples.In case a property-type placeholder are not matched to the content of the triples they are replaced by their corresponding instance type token (i.e.Section 4.3.1).The results are illustrated in Table 6.
Our scores are lower than those usually reported for machine translation tasks.However, they should be indicative of how well can our model generate a Wikipedia summary given the set of its corresponding triples.Furthermore, it should be noted that our task consists of learning to generate text from a corpus of triples loosely associated with text which is not the case in machine translation where there is a tight alignment between the source and the generated language.
In addition to the above experiments, we group Wikipedia summaries that are allocated to same number of input triples and compute a BLEU score per group.Figure 3 displays the performance of our models with the BLEU 4 metric across different numbers of input triples.The low performance of the models when they are initialised with a low number of triples is explained by the fact that the systems are lacking information required to form a twosentence summary.
Table 6: Automatic evaluation with the perplexity (i.e.lower is better), BLEU and ROUGE L metric (i.e. higher is better) on the validation and the test set.The average performance of the baseline along with its standard deviation is reported after sampling 10 times.

Human Evaluation
Given the exploratory nature of our task, human evaluation is necessary in order to objectively assess the performance of our approach.The human evaluation was conducted using seven researchers, all of whom are experts in the field of the Semantic Web.For each corpora, we compiled a list of 15 randomly selected sets of triples along with the textual summaries that have been generated from each one of our proposed models (i.e.(i) GRU with URIs and surface forms, and (ii) LSTM with URIs and surface forms).The sets of triples are sampled from the test set.We conducted two separate experiments, one for each corpora.
Our experiment showed that in our dataset, sets with less triples usually lack enough information for our model to generate a summary (i.e.Section 5.4).Hence, we included only sets that consist of at least 6 triples.The specific model to which each generated summary corresponds to (i.e.LSTM or GRU with URIs or surface form tuples) was anonymised.The evaluators were asked to rate each generated summary against four different criteria:(i) flu-ency18 , (ii) number of contradicting facts (i.e.information that exists in the sentence but it conflicts with a number of triples from the input set), (iii) number of summarised triples 19 (i.e.triples whose information is mentioned either implicitly or explicitly in the text), and (iv) number of triples to which potential additional information in the text can be interpreted.We divide each score for the three latter criteria with the total number of triples of their respective set.Fluency was marked on a scale from 1 to 6, with 1 indicating an incomprehensible summary, 2 a barely understandable summary with significant grammatical errors, 3 an understandable summary with grammatical flaws, 4 a comprehensible summary with minor grammatical errors, 5 a comprehensible and grammatically correct summary that reads a bit artificial, and 6 a coherent and grammatically correct summary [34].The results of the human evaluators are aligned with the result of our automatic evaluation with both the BLEU and ROUGE metrics (i.e.Section 5.4).The GRU-based architectures outperforms the LSTM-based in all criteria.Furthermore, they score consistently better in terms of the inclusion of additional or contradicting information.Since they are more reluctant to introduce out-of-context information in the text, their generated textual content is better aligned with the input triples.
In general the evaluators scored all of our models with high fluency ratings, thus, emphasising the ability of our approach to generate grammatically and syntactically correct text.We note, however, that verbalising the occurrence of entities in the text with the mechanism of surface form tuples makes all the investigated setups more fluent.

Discussion
Two examples of textual summaries that are generated by our models are shown in Table 7.We selected two representative sets of triples from the test set.The examples illustrate our approaches' capability of generating sentences that couple information from several triples from each set.In the first example, all the models are able to capture the main entity's gender for the input triple set.However, only in the case of the models equipped with surface form tuples, we are able to correctly verbalise the entity of dbr:Actor correctly as "actress" in the text.This is due to the fact that in the biographies dataset, the most frequent surface, with which the entity of dbr:Actor has been associated, is "actor".Subsequently, actor is used as the replacement of all the occurrences of the dbr : Actor entity in the summaries that are generated by our w/ URIs models.
The learned embeddings on the decoder side capture information that is both coupled with the embeddings on the encoder side (e.g. the embedding of the pronouns "She" and "her" are coupled implicitly with the existence of the triple: <item> dbo:occupation dbr:Actress, and their own probability of occurring in the context of the sequentially generated text (e.g.word with its first letter capitalised when it is following a full stop).Consequently, items that have similar semantic meaning find themselves close together in the continuous semantic space.Table 9 shows the nearest neighbours of some of the most frequently occurring entities in our datasets which have been learned by our models.This illustrates our models' capability to successfully infer semantic relationships among entities.
The main drawback of training our models on a dataset of loosely associated triples with text is that the information that exists in the triples does not necessarily appear in corresponding text, and vice versa.As a result, the models are not penalised when they generates textual content that does not exist in the set of input triples.

Conclusion
To the best of our knowledge this work constitutes the first attempt to use neural networks for Natural Language Generation on top of Semantic Web triples.We propose an end-to-end trainable system that is able to generate a textual summary given a set of triples as input.The generated summary discusses various aspects of the information encoded within the input triple set.
Our approach does not require any hand-engineered templates and can be applied to a great variety of domains.We propose an method of building a loosely aligned dataset of DBpedia and Wikidata triples with Wikipedia summaries in order to satisfy the training requirements of our system.Using these datasets, we have demonstrated that our technique is capable of scaling to domains with vocabularies of over 400k words.We address the problem of learning high quality vector representations for rare entities by adapting a multi-placeholder approach that enables to lexicalise rare entities in the text.Our models learn to emit those placeholder tokens that are replacing with the Table 8: The average rating of our models against the human evaluation criteria.For fluency and summarised triples the higher the score the better; for contradicting triples and additional information, the lower the score the better.The results are reported in the "mean ± standard deviation" format.wikidata:Q17427 (Communist Party of China), wikidata:Q865 (Taiwan), wikidata:Q7850 (Chinese language), wikidata:Q8686 (Shanghai), and wikidata:Q1348 (Kolkata) wikidata:Q20 (Norway) wikidata:Q35 (Denmark), wikidata:Q486156 (University of Oslo), wikidata:Q9043 (Norwegian language), wikidata:Q11739 (Lahore), and wikidata:Q585 (Oslo) wikidata:Q15981151 (jazz musician) wikidata:Q12800682 (saxophonist), wikidata:Q248970 (Berklee College of Music), wikidata:Q806349 (bandleader), wikidata:Q12804204 (percussionist), and wikidata:Q8341 (jazz) wikidata:Q158852 (conductor) wikidata:Q1415090 (film score composer), wikidata:Q9734 (symphony), wikidata:Q3455803 (director), wikidata:Q1198887 (music director), and wikidata:Q2994538 (Conservatoire national supérieur de musique et de danse) proach's ability of predicting the Wikipedia summary that corresponds to a set of unknown triples showing substantial improvement over our baselines.Furthermore, our choice to introduce a statistical approach for inferring the verbalisation of the entities in the text, further enhances the, reported by our human evaluators, fluency of the generated summaries compared to a purely deterministic replacement of the generated entities' URIs.

Figure 2 :
Figure 2: An idealised example of a beam-search decoder with a beam B of size 2 and target vocabulary size |X| equal to 9. The scores at the right-hand side of the words in the vocabulary is the probability of the summary when it is extended by that particular word.

Figure 3 :
Figure 3: Performance of our models with the BLEU 4 metric across the difference number of input triples on DBpedia (a) and Wikidata (b).

Table 1 :
An idealised example of our NLG task.Our system takes as an input a set of triples about Walt Disney, whose either subject or object is related to the entity of Walt Disney, and it generates a textual summary.

Table 2 :
An example of how a triple whose object is identified as a date is encoded into two different triples.The first one represents the month that has been identified in the original triple, and the second the year.

Table 3 :
Statistics regarding the initial and the training version of our two corpora based on DBpedia and Wikidata triples.

Table 9 :
Nearest neighbours of the vector representations of some of the most frequently occurring entities as these are learned by the encoder.