Table to text generation with accurate content copying

Generating fluent, coherent, and informative text from structured data is called table-to-text generation. Copying words from the table is a common method to solve the “out-of-vocabulary” problem, but it’s difficult to achieve accurate copying. In order to overcome this problem, we invent an auto-regressive framework based on the transformer that combines a copying mechanism and language modeling to generate target texts. Firstly, to make the model better learn the semantic relevance between table and text, we apply a word transformation method, which incorporates the field and position information into the target text to acquire the position of where to copy. Then we propose two auxiliary learning objectives, namely table-text constraint loss and copy loss. Table-text constraint loss is used to effectively model table inputs, whereas copy loss is exploited to precisely copy word fragments from a table. Furthermore, we improve the text search strategy to reduce the probability of generating incoherent and repetitive sentences. The model is verified by experiments on two datasets and better results are obtained than the baseline model. On WIKIBIO, the result is improved from 45.47 to 46.87 on BLEU and from 41.54 to 42.28 on ROUGE. On ROTOWIRE, the result is increased by 4.29% on CO metric, and 1.93 points higher on BLEU.


Related work
to-text generation has attracted widespread attention, aiming to help humans better understand tabular data. We classify table-to-text generation into two groups: pipeline pattern and end-to-end methods. Early data-to-text generation methods follow the pipeline pattern that divides generation into content selection and surface realization. Pipeline pattern relies heavily on rule-based and template-based approaches, which typically involves selecting the correct rule set or retrieving the appropriate template for the generation task at hand 13,14 .
In recent years, due to the emergence of massively parallel datasets such as WIKIBIO, end-to-end neural network methods have become a research focus. In 15 proposed the neural checklist model to address the problem of repeated information generation in structured data for recurrent neural network (RNN) models. The model is applied to the generation of menus, where dish names and ingredient lists are the inputs, and the machine outputs the corresponding recipes. Text generation based on structured data suffers from data sparsity, and many attributes and values in structured data rarely occur, making it challenging for the algorithm to learn the model. In 4 introduced the copy mechanism into the neural language model to cope with the problem of sparse data. Based on the conditional neural language model, the structured data are parsed locally and globally, with a focus on the attribute information in the data. In 16 introduced multiple decoders with hidden variable factors to specify which decoder generated the final text based on the classical sequence-to-sequence model. Learning is enhanced by setting up multiple submodels that are only responsible for processing specific data expressions. In 17 explicitly modeled content selection and content planning in an end-to-end neural network architecture. The generation task is divided into two stages by first conducting content selection and content planning operations to highlight the content and order of information that should be mentioned and then taking the generated content plan as input and outputting the text. Additionally, to increase the interpretability and controllability of the models, a number of models have recently emerged that combine end-to-end approaches with traditional rule-based and template-based approaches. In 18 used a hidden semi-Markov model (HSMM) to model text and parameterized all probability terms with a neural network. After completing the training of the model, the Viterbi algorithm is used to obtain templates for text generation.
Although the above algorithms have achieved promising results, the use of RNN-based models fails to capture long-term dependencies. In 19 used the transformer-based model for machine translation. The sentence-level agreement module is used to minimize the differences between the source and target sentences, resulting in a close distribution of sentence-level vectors between the source and target sides. In 20 presented a transformerbased data-to-text generation model that learns content selection and surface realization in an end-to-end manner. It improves the correctness of the output by modifying the input representation; it also adds an additional learning objective for content selection modeling and achieves good results on game summaries. In 21 proposed a few-shot table-to-text generation. Model uses a powerful pre-training model (GPT-2) and two auxiliary learning tasks, outperforming state-of-the-art baselines on three few-shot datasets. A template-based table transformation module is employed to convert the table into a sequence. Two auxiliary learning tasks of table structure  reconstruction and content matching are used to solve the pre-training model's lack of table structure modeling and text fidelity. In 22 proposed a general knowledge-based pre-training model (KGPT) to deal with various text generation tasks, and achieved powerful performance with few samples and zero samples. They first pre-train the model on the constructed knowledge-based KGTEXT dataset, and then fine-tune the model on downstream tasks like WikiBio 4 , WebNLG 23 and E2ENLG 24 . In 25 proposed a new algorithm to solve the problem of faithful table-to-text generation. Two faithful generation methods are proposed: generation according to the augmented plan and selection of training examples based on faithfulness ranking. In addition, two new metrics are introduced to evaluate generation faithfulness. In 26 proposed an end-to-end model to generate entity descriptions. They adopt the joint learning of text generating and content-planning to deal with disordered input, and apply the content-plan-based bag of tokens attention mechanism to highlight salient attributes in an appropriate order.

Table-to-text generation
The task of table-to-text generation is to take a structured table, T = {(f 1 , v 1 ), (f 2 , v 2 ), ..., (f m , v m )} as input and output a natural language description that consists of a sequence of words y = {y 1 , y 2 , ..., y n } . Each input sentence T i = {f i , v i } consistsof a field f i and its corresponding sequence of word fragments Here, m is the number of fields and values, n is the number of words in each description, and l is the number of words in each value. Figure 1 illustrates the overall framework of our model. Our model uses the encoder-decoder architecture. The encoder is composed of an input layer and N identical layers. Each layer has two sub-layers. The decoder consists of an input layer, N identical layers, a linear layer and a softmax layer. "Nx" means a stack of N identical layers. In the experiment, we set N to 6. In "Nx", in addition to the two sub-layers of the encoder layer, the decoder layer adds a third sub-layer. The final output of the decoding is the probability distribution of the word at the corresponding position. www.nature.com/scientificreports/ The model is designed from two aspects: table content copying and language modeling to generate target texts. In the training process, we propose two auxiliary learning tasks: table-text constraint loss and copy loss in addition to the traditional generation task.
Transformer model. We adopt the transformer model 27 as our base model. The transformer is based solely on a self-attention mechanism, thereby removing the recurrence and convolution operations completely. The self-attention mechanism has two sublayers, the multi-head self-attention defined by Eqs. (1-3) and feedforward networks defined by Eq. (4). Our proposed transformer-based table-to-text generation model learns to estimate the conditional probability of a text sequence from a source table input, as shown in Eq. (5). (1) includes the positions of the token counted from the beginning and the end of the field as the positional embedding of the token, replacing positional encoding in the transformer model. Therefore, the field embedding Ẑ enc and context embedding Ĉ enc are concatenated to obtain the embedding representation of table X = {Ẑ enc ;Ĉ enc } . We define R enc as a table representation of X via the self-attention layers in the encoder. E dec is the target text representation of y obtained by embedding layers in the decoder. where R enc = Mean(R enc ) and Ê enc = Mean(E dec ) are the mean value embeddings of the source and target sentences, respectively.
Pointer-generator network with copy loss. In this part, we use word transformation module and copy loss to guide the pointer-generation network 10 correctly copy the table content.
Pointer-generator network. The pointer-generator network combines the seq2seq model with a pointer network, which maintains p copy to choose between copying from an input table or generating from a fixed vocabulary list. Therefore, the final word probability distribution is where W h , W s , W x and b are learnable parameters; p vocab denotes the probabilities of generating the next word, h t = i a t i h i , h i are the hidden states of the encoder; and y new t , s t , and a t i are the input of the decoder, the hidden state and the attention weights returned from the encoder-decoder attention module, respectively.
Copy loss. To provide accurate guidance to the pointer-generator network, we employ word transformation methods and auxiliary learning tasks in the model.
We first use word transformation methods to process the target text. When matching the words in the target text with the values in the table, if the word y i in the target text appears in the table, y i is replaced with the field and position information of the value in the table, such as (name,position+,position-), where "position+" and "position-" indicate the positions of the token counted from the beginning and the end of the field, respectively. Words that do not appear in the table are replaced by the word "empty", and the position information is recorded as zero. Table 1 describes the target text transformation results. For example, when the word "war" of the target text appears in the table, we replace the word "war" with (genre, 1, 1).
Additionally, we find that the value corresponding to the "country" attribute often has different expressions as word aliases. If another name for the word in the target text appears in the table, it is processed as above to obtain the corresponding field and position information of the target text. As shown in Table 1 , "United States" in the table and "American" in the target text do not match, but they represent the same country, so the field of "United States" is used instead of "American" in the target text. Finally, we concatenate the content embedding ŷ dec , field embedding f dec , and position embedding (p + dec , p − dec ) of the target text as inputs for the decoder.
At the position of the matching words, we maximize the p copy . Our loss function is as follows: Search strategy. The sentences generated in the decoder phase are repetitive, incoherent, and boring. Even with sufficient input from the state-of-the-art BERT 28 and GPT 29 language models, it is hard to generate highquality texts. The main reason for this phenomenon is the use of maximization-based search strategies, such as the beam search. The beam search algorithm takes the top n (width of the beam search) tokens at a time from the vocabulary with the highest probability, repeats the process until a terminator is encountered, and finally outputs the top n sequences with the highest scores. The algorithm usually assigns a higher probability to wellformatted texts than to poorly-formatted texts, but in long texts, high probability outputs tend to yield generic and repetitive sequences.
To address this phenomenon, we use a combination of nucleus sampling (top-p sampling) 12 and the top-k sampling strategy 11 as our search strategy. By truncating the unreliable tails of probability distributions, sampling from tokens containing the vast majority of high-probability words enables the model to avoid the generation of very low-ranked words and allows for dynamic selection.
Top-k-top-p sampling. The top k words V (k) ∈ V with the highest probabilities are selected from the vocabulary V to avoid generating very low-ranked words. The word in the vocabulary V (k) whose sum of probabilities is greater than the threshold p is then selected, and the original distribution is rescaled to a new distribution from which the next word is sampled. The size of the sampling set is dynamically adjusted according to the shape of the probability distribution at each time step.
Loss function. Our objective function L consists of three parts: table-to-text constraint loss function L CL , table-to-text generation loss function L GL and copy loss L copy : where L GL = −logP(y|T; θ) , P(y|T; θ) is defined in Eq. (5), 1 and 2 are hyper-parameters.

Experiment
We use WIKIBIO 4 and ROTOWIRE 9 as benchmark datasets.
Experiment on WIKIBIO. Dataset and evaluation metrics. WIKIBIO contains 728,321 articles from the English version of Wikipedia. The first sentence of each article in WIKIBIO is extracted as the corresponding reference of the infobox. Table 2 shows the dataset statistics. There is an average of 26.1 tokens per reference, of which 9.5 tokens appear in the infobox. Each infobox has an average of 53.1 tokens and 19.7 fields. We divide the dataset into training (80%), validation (10%), and testing (10%) sets. The detail of dataset division is listed in Table 3. We use BLEU-4 and ROUGE-4 (F-measure) as automatic evaluation metrics. They are computed by NIST mteval-v13a.pl (BLEU) and MSR rouge-1.5.5 (ROUGE). We use an Adam optimizer 30 and GELU activation function 31 to train the model. For the hyper-parameters of Adam optimizer, the learning rate is initially set to 0.001. We half the learning rate when the model fails to improve performance on the validation sets in 2 epochs. The label smoothing factor is 0.05. We clip the gradients 32 , the maximum norm of the gradients is 5. In the inference state, we adapt nucleus sampling with p=0.95 and top-k sampling with k=30. Beam size is set to be 5. The maximum length of the generated sentence is limited to 150 by counting the length of the reference text. According to the experimental results on the validation set, the weight 1 of the table-to-text constraint loss is 0.2, and the weight 2 of the copy loss is 0.5.
Baselines. We compare our model with six baseline models. For each of them, we use the same parameter settings as the corresponding paper, and report the best experimental results of each baseline model, the baselines are as follows: Table NLM 4 is based on the conditional language model and introduces a copy mechanism to solve the problem of sparse data. The structured data are parsed both locally and globally, with a focus on the attribute information of words. Furthermore, Wikipedia's biographical dataset WIKIBIO is created. Order-plan model 5 uses a content-based and link-based hybrid attention mechanism to plan the form of the content, and on the decoding side, an RNN network with a copy mechanism is used to solve the out-ofvocabulary problem.
Structure-aware Seq2seq model 1 involves field gating and dual attention mechanisms. In the encoding phase, field information is integrated into the table representation by adopting field gating, and a dual attention mechanism consisting of word-level attention and field-level attention is proposed to effectively model the semantic information between the input tables and generated descriptions in the decoding phase. FA+RL method 7 uses an attention-based approach to encourage decoders to focus on uncovered attribute information and avoid missing critical attribute information; this is done using reinforcement learning to generate descriptions that are informative and faithful to table inputs.
NCP model 17 is a two-stage model that combines content selection and content planning. First, a content plan is generated through the pointer generation network. Then, the content plan is employed as the input of the recurrent neural network to generate a description.
NCP+BAT 35 is an end-to-end model that jointly learns the content planning and text generation. The content plan is integrated into the encoder-decoder model by using the coverage mechanism.
Overall experimental results. We carry out experiments on WIKIBIO dataset, and Table 4 shows the experimental results of the various models. To determine whether our model results in a statistical difference for the evaluation metrics, we utilize the paired T-Test in Table 4. It can be concluded from Table 4 that our model is different from the baseline models at a significant level of 0.01. We further comparethe mean of our model and the baseline model. The mean values of our model are 46.87 and 42.28, respectively, which are higher than the mean values of all baseline models.
Transformer(base) represents a transformer-based data-to-text model without any learning task. Compared with the RNN-based Table NLM model, our Transformer model (base) uses the same input and search strategy, but the BLEU value and ROUGE value of our model are improved by 9.63 (from 34.70 to 44.33) and 14.32(from 25.80 to 40.12) , respectively. Thus, with sufficient data, the quality of text generation can be significantly improved by applying the transformer-based model. The NCP model, NCP+BAT model and FA+RL model improve the performance of the model by allowing the decoder to focus on key attribute information. Compared with the baseline models, our model is much better on ROUGE and BLEU. We apply table-to-text constraint loss (row 9) to enhance the representation of the table content, which makes the semantics of the table content and the target text closer. The BLEU value and ROUGE value of the model increased by 1.6 and 1.24 respectively. The experimental results confirm our theory that appropriate learning objectives can enhance the performance of the model. We adopt target text preprocessing and copy loss in the model (line 10) to faithfully copy the contents of the  www.nature.com/scientificreports/ ROUGE value. Experimental results show that L CL is more beneficial to text generation against the L copy . In the eleventh row of Table 4, we use a combination of top-p and top-k sampling instead of beam search sampling, and this improves the BLEU score from 44.33 to 44.79 and the ROUGE score from 40.12 to 40.57. The new sampling policy alleviates problems such as redundancy and inconsistency and improves the quality of text generated. We visualize the process of generating a paragraph description based on an infobox in Fig. 2 , where the horizontal coordinate represents the value in the table, and the vertical coordinate represents the generated text. The table word corresponding to the largest attention weight is selected as the word generated at the current moment. For example, when generating the third token, the word "general" in the table receives the largest attention weight, so "general" is used as the generated word. Most of the attention weights in Fig. 2 yield the desired results, further confirming the importance of our model. Table-to-text constraint loss effectiveness analysis. We study how the table-to-text constraint loss ( L CL ) affects the similarity of source and target sentences. We adopt the cosine similarity 37 to calculate the similarity between the source and the target sentences, where each sentence is represented by the mean value of word embedding, and the similarity calculation equation is defined as: From the second to fourth columns of Table 5, it can be seen that the generation performance (BLEU and ROUGE) and sentence similarity (Sim) are higher than the transformer (base) by increasing the table-to-text constraint loss. This shows that there is a correlation between the performance of text generation and the similarity of sentences, the more similar the source and the target sentences, the better the performance of text generation. The experimental results prove that improving the similarity between the table and the target text is an effective method to improve the performance of the model.
We further analyze the efficiency of table-to-text constraint loss from the speed and performance of the model. Compared with the transformer (base), L CL achieves superior generation performance without any parameter increase. The BLEU value and ROUGE value are increased by 1.6 and 1.24 points, respectively, and the training speed is barely reduced approximately 1%. It shows that table-to-text constraint loss can improve the quality of text generation without sacrificing training speed.
Case study. Four of the generated texts are randomly selected for a comparison with the reference text, and the experimental results are shown in Table 6. More table-to-text generation examples are listed in Appendix A. "Reference" indicates the reference text, and "Generation" indicates the generated text. As seen in Table 6, there is redundancy in the first and third generated texts. Although the second generated text is not consistent with the reference text, the text generated by our model is more faithful to the table's contents. In addition, our model can learn the relationship between "north carolina" and "American" without external knowledge. The last line generates text that does not fully describe the reference text, but the missing parts do not appear in the table. There are slight differences between generated text and reference text, but most of the generated text exactly replicates the content of the table, which is primarily due to our copy loss.
From the above analysis, it is clear that our model more reliably describes the table contents than other models, although there is a small amount of redundancy. Therefore, it is worth exploring whether the generated text should be closer to the reference text or more faithful to the table input.

Experiment on ROTOWIRE.
Experiments on the WIKIBIO dataset demonstrate the effectiveness of the model. In this part, we perform experiments on the ROTOWIRE dataset to prove the generality of the model. Compared with the WIKIBIO dataset, the ROTOWIRE dataset is basically in a digital format. Therefore, the model is required to understand the relationship between numerical data. (14) sim = cos(Ê enc ,Ê dec ) Table 4. BLEU and ROUGE Scores on the WIKIBIO Dataset. For each model, we report the "mean standard deviation". Compare with our model, * p < 0.05 , * * p < 0.01. www.nature.com/scientificreports/ Dataset and evaluation metrics. ROTOWIRE dataset consists of (human-written) NBA basketball game summaries with their corresponding box-scores and line-scores. In the line-score tables, each team is described by 15 types of values. In the box-score tables, each player has 23 different types of values, each row corresponds to a player in the game. The average length of the summary is 337.1 tokens, and the vocabulary size is 11.3K. The summaries have been randomly split into training, validation, and test sets consisting of 3398, 727, and 728 summaries, respectively, the detail is shown in Table 7. We use BLEU-4 and several content-oriented metrics 9 to evaluate model output. For content-oriented metrics, we apply the public IE system 17 to extract relations. Content-oriented metrics include three aspects:

Model BLEU ROUGE
• Content Selection (CS) evaluates the recall rate and precision of extracted relations in the generated description and gold description.   www.nature.com/scientificreports/ Implementation details. We use a transformer model, where the number of blocks is set to 6, the number of heads is 8 and hidden units are 512. In the data preprocessing stage, the input table is converted into a fixedlength sequence of records. Each record consists of four types of information (entity, type, value and game information), the record embedding size is 128. Since there is no order relationship in the records, only learn the position embedding of the decoder in the transformer. Our model is trained with GELU activationfunction 31 and Adam optimizer 30 . The learning rate is fixed to 0.0001 in the Adam optimizer. The label smoothing factor is 0.05. In the inference state, we adapt nucleus sampling with p=0.95 and top-k sampling with k=40. The maximum length of the generated sentence is limited to 600 by counting the length of the reference text. According to the experimental results on the validation set, the weight 1 of the table-to-text constraint loss is 0.2, and the weight 2 of the copy loss is 0.5.
Experimental results. On the ROTOWIRE dataset, we use five baseline models. For each of them, we adopt the best experimental results in each paper. GOLD represents the experimental results on the gold summary. The baseline models are as follows: CC 9 adopts a conditional copy mechanism in the encoder-decoder model. Template is a template-based generator model same as the one used in 9 which generates 8 templated sentences from the training set: a sentence about the teams playing in the game, 6 highest-scoring players sentences and a conclusion sentence. The NCP model 17 combines content selection and content planning in a neural network architecture. The RCT model 33 considers the row, column, and time dimension information in the input table, and then combines the threedimensional representations into a dense vector through the table cell fusion gate. The Hierarch-k model 36 employ a novel two-level Transformer encoder to hierarchically capture the structure of the data. Two variants of hierarchical attention mechanism are used to get context as the input of decoder. Table 8 displays the automatic evaluation results of the ROTOWIRE dataset on the validation set. Our model achieves significantly higher results than all other baseline models in BLEU, CS precision and CO metrics. Our model generates almost the same number of records as the baseline model CC, but has a significant improvement in other metrics. Comparing to CC model, our model is 12.58% higher on CS precision, 12.99% higher on CS recall, 11.42% higher on RG precision, 8.39% higher on CO metric, and 4.95 points higher on BLEU. Comparing to NCP and RCT, our model is better on CS precision, Content Ordering metric and BLEU. This Table 6. Results of the comparison between the reference text and generated text.  www.nature.com/scientificreports/ may be due to the fact that our model generates almost the same number of relations as the gold summary, reducing the normalized DL Distance 34 between the two sequences of relations. However, our model performs lower RG precision and lower CS recall with the number of relations decreases. Experimental results on the test set are shown in Table 9. As can be seen from Table 8 and Table 9, the experimental results of the test set and the validation set are not significantly different. Compared with all other contrast models, our model gets higher CS precision, CO metric and BLEU. Our model yields a more outstanding BLEU value (19.43 vs. 17.50) against the best baseline Hierarch-k. This shows that the text generated by our model is closer to the gold summary and can generate more fluent target text.
Ablation studies. Next, we conduct ablation studies to evaluate the various components of our model. This is: • The table-text constraint loss to constrain the complex structure of the table by the target text.
• The copy loss aiming at providing accurate guidance to the pointer-generator network.
• The search strategy to reduce the probability of problems such as sentence repetition and boring.
Removing the table-text constraint loss. In this configuration, we employ the same search strategy and copy loss as our model, but the model is trained without table-text constraint loss. It can be concluded from Table 10 (-TT_CL) that almost the same number of records as our model have been extracted, but the accuracy is decreased by 6.93%. CS precision and CS recall are dropped by 4.25% and 4.34%, respectively.