Tüpa at SemEval-2019 Task1: (Almost) feature-free Semantic Parsing

Our submission for Task 1 ‘Cross-lingual Semantic Parsing with UCCA’ at SemEval-2018 is a feed-forward neural network that builds upon an existing state-of-the-art transition-based directed acyclic graph parser. We replace most of its features by deep contextualized word embeddings and introduce an approximation to represent non-terminal nodes in the graph as an aggregation of their terminal children. We further demonstrate how augmenting data using the baseline systems provides a consistent advantage in all open submission tracks. We submitted results to all open tracks (English, in- and out-of-domain, German in-domain and French in-domain, low-resource). Our system achieves competitive performance in all settings besides the French, where we did not augment the data. Post-evaluation experiments showed that data augmentation is especially crucial in this setting.


Introduction
Semantic Parsing is the task of assigning an utterance a structured representation of its meaning.The goal is to assign similar structures to utterances with similar meanings, regardless of their syntactic realizations.In Syntactic Parsing, for instance, the sentence 'John saw Paul.' will have a different structure than 'Paul was seen by John'.Semantic Parsing, in contrast, aims to solely encode the fact that John saw Paul.Deriving a semantic representation of an utterance has various applications.It can serve as a starting point for the evaluation of machine translation systems, as the structure of the semantic representation should be similar across languages.Birch et al. (2016) use human annotated scores of individual UCCA semantic units in their HUME metric to provide a fine-grained analysis of translation quality and improve scalability to longer sen-tences by approximating human judgement semiautomatically from the annotated scores of each unit.Explicit semantic representations could also provide the structured information necessary to alleviate recent issues in Natural Language Inference (NLI) where McCoy and Linzen (2019) showed that state-of-the-art NLI systems fail to recognize that e.g.'Alice believes Mary is lying.' does not entail 'Alice believes Mary.'.Using precise semantic representations of the sentences a theorem could be built on which various logical inferences can be performed with a theorem prover such as in Martínez-Gómez et al. (2016).
Universal Conceptual Cognitive Annotation (UCCA) (Abend and Rappoport, 2013) is a semantic grammar formalism where natural language expressions are analyzed as deep directed acyclic graph (DAG) structures, deep meaning the graphs feature non-terminal nodes.Due to it's coarse-grained representation using cognitively motivated categories it is both domain and language independent and quickly learned even by annotators without a linguistic background (Abend and Rappoport, 2013).
The goal of the SemEval-2018 Task 1 'Crosslingual Semantic Parsing with UCCA' was to develop a parser producing UCCA-DAG structures trained on articles from Wikipedia in English and passages from the book "Twenty Thousand Leagues Under the Sea" in French and German.The parsers were evaluated on the DAG-F1 metric on in-domain passages in English, French and German as well as out-of-domain passages in English in both an open and a closed track (Hershcovich et al., 2018b).Since we made extensive use of external resources we participated only in the open track of all settings.
For our participation, we build upon the transition-based DAG parser Tupa (Hershcovich et al., 2017).Our adaptation reuses the transition system and oracle.We extend Tupa with respect to its representations of non-terminal nodes in a way that they are an aggregation of all their terminal nodes.While Tupa uses a Recurrent Neural Network, our system is a simple feed-forward network that uses a small set of features and ELMo contextualized embeddings (Peters et al., 2018)

Background
Until recently, semantic parsers were exclusively symbolic rule-based systems (Bos, 2005).These systems rely on complex hand-written and necessarily language-specific sets of rules, requiring a re-implementation for every new language.More recently, neural methods have also arrived in the domain of Semantic Parsing.They achieve stateof-the-art results while being largely languageagnostic.Since these systems usually require large amounts of annotated data, this line of work is largely concerned with the augmentation of training data.Hershcovich et al. (2018a) recognize the similarity between several annotation schemes and jointly learn to parse other semantic formalisms in a multi-task setting, while van Noord et al. (2018) add large amounts of automatically annotated data to their training data.Both approaches led to significant improvements over not using the additional data.

Silver Data
We created additional training data for both English and German using the open track baseline systems.The English silver data was taken from the 1B word benchmark (Chelba et al., 2014), the German from the archive of the newspaper taz.For both languages, we took the first 15,000 sentences of the corpora and added UCCA annotation using the baseline systems.Our training datasets then consisted of the concatenation of gold and silver data, and another gold only set.Due to a lack of time we did not create silver data for our French submission.Post-evaluation results for French, trained on v2.0 of the GSD treebank2 provided by Universal Dependencies (Nivre et al., 2016), are presented in Section 6.1.

System
Our system is an ensemble of small feed-forward neural networks.We use three global features: typed absolute counts for previous parser actions and action-and node-ratios (Hershcovich et al., 2017).We further follow the standard in transition-based parsing and extract a set of features based on the top three items on stack and buffer.To capture some of the structure of the partially built graph, we extract the rightmost and leftmost parents and children of the respective items, following Hershcovich et al. (2017).Each of these items is represented by the ELMo embedding of its form, the embedding of its dependency head and the embeddings of all terminal children.We use the average over all ELMo layers to retrieve the embedding of a word.Non-terminal nodes do not have a form or dependency head, hence these are represented by a learned non-terminal embedding.Both the non-terminals and terminals have a third feature, a representation of their children.
In the case of terminals, this feature is equal to its form feature.For the non-terminals, it is an aggregation of all its children, produced by the child representation module.Figure 1 illustrates the set of features used by our system.We experimented with richer feature sets, including the last parser actions, named-entity, part-of-speech and dependency types, but dropped them after performing preliminary experiments.The input to the feedforward module is the concatenation of all features with the output of the child representation module.The classification portion of the system was imple-mented using Tensorflow (Abadi et al., 2015).

Representing Non-Terminals
The child representation module aims to enrich the representation of non-terminal nodes.Our initial representation for non-terminal nodes was a set of discrete features describing the number of typed in-and outgoing edges and the nodes' height in the tree.While this might be informative on an abstract level, it does not provide any information about the content covered by this node.We solve this poverty of information by concatenating each of the embeddings of the terminal children of a node with an embedding for the first edge type leading to them.The resulting combination is fed through a dense layer with d neurons, resulting in n vectors with d dimensions where n is the number of terminals under the node.We then reduce the n vectors into a single d dimensional vector by taking the maximum value of each dimension.
Figure 2 depicts how the representation of a nonterminal node is obtained.While it would be desirable to process the children using context-aware methods, such as RNNs or self attention, it is not feasible since some of the nodes can have more than 100 children.Future work should explore recursive formulations for representing a node by its direct children instead of relying on all terminal children, performing largely redundant operations for higher nodes.

Hyperparameters
We apply dropout (Srivastava et al., 2014)  and settled on these after initial experiments.All results were produced using a five model ensemble, consisting of the model with the best transition accuracy and the four following it before early stopping.The results show that our parser achieves competitive performance to the baseline while relying on fewer features.In particular, for the English in-domain data, we achieve the same performance as the baseline, for the out-of-domain data we surpass it by 0.025 DAG-F1.In German and French where only in-domain data exists our approach is outperformed by the baseline which we partially attribute to issues in the creation of the silver data.Post-submission results obtained after performing a more exhaustive hyperparameter search on the development set and with correct silver-data surpass the baseline performance on the test sets in all open settings.

Further Experiments
In this section, we will describe the findings of our post-evaluation experiments.We evaluated the effect of silver data and provide results for French with silver data.We further performed experiments on non-terminal representations and investigated the effect of model size.Since this only covers a fraction of our experiments and describing them all would be out of scope, we  provide the full results alongside their hyperparameters at https://twuebi.github.io/publications/ucca_post_eval.pdf.

Silver Data
We measured the effect of silver data on English and French by evaluating several model configurations in two settings.The first setting matches the training data used for the submission and is the concatenation of the gold and silver data.In the second setting, the only available data is the gold data.English: We trained three models for English.The first model configuration uses all features and corresponds to the model described in the end of Section 4, the second is our submission, described in Section 4. The last model uses only embeddings of the forms and dependency heads.As shown in Table 2, additional training data provides a consistent boost in F1 score across all tested feature combinations.Moreover, it seems that there is a larger effect of the silver data on models with more features, indicating a better estimation of the feature representations based on the additional data.
Low resource setting: Table 3 demonstrates the effect of silver data on French for the submission model configuration.The effect of additional data is the largest in the low-resource setting, providing a boost of 0.1 in average F1 score.Adding the silver data also leads to some of the remote edges being correct, whereas there are no correct  remote edges for the gold-only model.

Non-Terminal Representation
To measure the effectiveness of our non-terminal representation, we ran two experiments using silver and gold data.In both cases, we trained one model with aggregated non-terminal representations and one with the discrete representations of typed in-and outgoing edges and the nodes' heights in the tree.The first experiment used all available features.The second was trained with the features of our submission.Table 4 presents the results of the experiments.The explicit child representations provide a clear improvement over the discrete representation.In the second experiment, where no in-and outgoing edges were used and the only non-terminal representations are the left-and rightmost children, the gap increased even further, in fact it is the worst F1 score of all models trained on silver data.Figure 3: DAG F1 scores on English development data on the y-axis.Million parameters in the models on the x-axis.Larger models seem provide some improvements that begin to level off for big models.

Bigger means better?
Figure 3 contrasts the number of trainable parameters of the models in our experiments with the F1 score on the English development set.While there are some improvements for larger models, it can be seen that the effect begins to level off at 200M parameters and eventually leads to a small regression with the largest model.Possible causes are overfitting and a lack of training data.Future work should explore whether additional training data allows for larger models.Additional regularization such as L2 regularization might also prove useful.For our experiments, this was out of scope since training so many models was not feasible.

Conclusion
In this work, we presented a parser for the semantic grammar formalism UCCA.Our parser relies on a small set of features and achieves competitive performance on the English and German data, but lags behind on French where almost no training data is available.We demonstrated, using ablation experiments, that the explicit representation of non-terminals and additional silver data are crucial for our result.We have further shown that silver data is especially helpful in the low-resource setting where it boosts the average F1 score from 0.456 to 0.557.Future work should investigate how much more improvement additional data can provide.This should be explored both in form of other formalisms (Hershcovich et al., 2018a) and silver data (van Noord et al., 2018).Besides the data aspect, we also believe that improving the non-terminal representation will lead to significant gains.The goal should be to find a representation that leverages the recursive structure of the partially built graph.

Figure 2 :Full
Figure 2: Depiction of a non-terminal representation.The terminal children dominated by the grey node are concatenated with the first edge leading to them and fed through a fully-connected layer.The multiple resulting vectors are reduced into a single one by taking the maximum value of each dimension.
Illustration of the features used by Tüpa.The final feature vector results from the concatenation of all stack and buffer features with the global features.Features dropped after preliminary experiments are omitted for brevity.

Table 1 :
DAG-F1, primary F1 and remote F1 scores with the DAG-F1 score of the baseline on the test sets in the open tracks.
We use mini-batches of size 192 and evaluate on the development set every 1000 mini-batches.As training time imposes a serious limitation, we did not perform an extensive hyperparameter search

Table 2 :
DAG F1 scores on the English development set after training with gold and gold+silver data.Silver data provides a boost for all combinations.

Table 3 :
DAG F1 Scores on the French test set with and without silver data.Here in the low-resource setting, the effect of additional data is the largest.Without silver data, the parser did not predict any remote edges correctly.

Table 4 :
Effect of discrete and aggregated non-terminal representations on the DAG F1 score on the English development set.The aggregated representation provides a clear advantage over the discrete one.