Variants of Vector Space Reductions for Predicting the Compositionality of English Noun Compounds

Predicting the degree of compositionality of noun compounds such as “snowball” and “butterfly” is a crucial ingredient for lexicography and Natural Language Processing applications, to know whether the compound should be treated as a whole, or through its constituents, and what it means. Computational approaches for an automatic prediction typically represent and compare compounds and their constituents within a vector space and use distributional similarity as a proxy to predict the semantic relatedness between the compounds and their constituents as the compound’s degree of compositionality. This paper provides a systematic evaluation of vector-space reduction variants across kinds, exploring reductions based on part-of-speech next to and also in combination with Principal Components Analysis using Singular Value and word2vec embeddings. We show that word2vec and nouns only dimensionality reductions are the most successful and stable vector space variants for our task.


Introduction
Predicting the degree of compositionality of noun compounds (and multi-word expressions in more general) is a crucial ingredient for lexicography and Natural Language Processing (NLP) applications, to know whether the expression should be treated as a whole, or through its constituents, and what the expression means. Compare, for example, the English noun compounds snowball -a ball consisting of snow, where clearly both constituents snow and ball contribute to the meaning of the compound-and butterfly -where the semantic contribution of the modifier noun butter is not obvious without knowing about the etymology of the compound. Studies such as Cholakov and Kordoni (2014), Weller et al. (2014), Cap et al. (2015), and Salehi et al. (2015b) are examples of NLP applications that have integrated the prediction of multi-word compositionality into statistical machine translation. Accordingly, the field has witnessed a rich amount of computational approaches to automatically predict the degree of compositionality of noun compounds. These approaches typically represent compounds and their constituents within a vector space, and then compare the compound vectors with the constituent vectors as a proxy to the compounds' degree of compositionality (Reddy et al., 2011b;Reddy et al., 2011a;Salehi and Cook, 2013;Schulte im Walde et al., 2013;Salehi et al., 2014;Schulte im Walde et al., 2016;Cordeiro et al., 2019). Most of the approaches focus on English and German; most recently, Cordeiro et al. (2019) applied their framework to also French and Portuguese. All of the above-mentioned approaches explored variants of vector space models in some way, regarding the composite functions to combine the constituent vectors (Reddy et al., 2011b); or regarding the translations of compounds and constituents into multiple languages (Salehi et al., 2014); or regarding the contributions of modifiers and heads (Schulte im Walde et al., 2016); etc. What is still lacking, how-ever, is a systematic assessment of the effect of vector-space reductions on the quality of predicting compositionality: Bullinaria and Levy (2012) explored the effect of Singular Value Decomposition (SVD) on semantics in vector spaces in general; and from Baroni et al. (2014b) and Levy et al. (2015) -among many others-we know that word embeddings provide a useful low-dimensional representation for vector spaces. But as to our knowledge, up to date only Salehi et al. (2015a) and Cordeiro et al. (2019) integrated vector-space reductions (in the form of word embeddings) into their computational prediction of noun compound compositionality, and Schulte im Walde et al. (2013) explored part-of-speech-based reductions in combination with frequency effects. Our contribution in this paper is to provide a systematic evaluation of vector-space reductions across kinds, i.e., exploring part-of-speech-based reduction, Principal Components Analysis using Singular Value Decomposition, and word2vec embeddings. Relying on the English noun compound dataset by Reddy et al. (2011b) as our gold standard, we show that word2vec and nouns-only dimensionality reductions are the most successful and stable vector space variants for our task.

Related Work
Most closely related studies includes distributional approaches that predict the degree of compositionality of a compound regarding a specific constituent (by comparing the compound vector to the respective constituent vector), or a functional combination of several constituents' vectors. Most importantly, Reddy et al. (2011b) used a standard distributional model to predict the compositionality of compound-constituent pairs for 90 English compounds. They extended their predictions by applying composite functions (see above). In a similar vein, Schulte im Walde et al. (2013) predicted the compositionality for 244 German compounds, and Schulte im Walde et al. (2016) investigated their models for further datasets and taking compound and constituent properties into account. Salehi et al. (2014) defined a cross-lingual distributional model that used translations into multiple languages and distributional similarities in the respective languages, to predict the compositionality for the two datasets from Reddy et al. (2011b) and Schulte im Walde et al. (2013). Cordeiro et al. (2019) provide the most recent investigation in a cross-linguistic study on the effects of corpus, modelling and composite parameters for English, French and Portuguese.

Gold Standard of Noun Compounds
Our focus of interest is on English noun compounds, such as butterfly, snowball and teaspoon as well as car park, zebra crossing and couch potato, 1 where the grammatical head (in English, this is typically the rightmost constituent) is a noun. We are interested in the degrees of compositionality of noun compounds, i.e., the semantic relatedness between the meaning of a compound (e.g., snowball) and the meanings of its constituents (e.g., snow and ball). As gold standard we used the dataset of English noun compounds created by Reddy et al. (2011b). Assuming that compounds whose constituents appeared either as their hypernyms or in their definitions tend to be compositional, Reddy et al. induced a candidate compound set with various degrees of compound-constituent relatedness from WordNet (Miller et al., 1990;Fellbaum, 1998) and Wiktionary. A random choice of 90 compounds that appeared with a corpus frequency > 50 in the ukWaC corpus (Baroni et al., 2009) constituted their gold-standard dataset and was annotated by compositionality ratings on the semantic contribution of the modifier to the compound meaning (Word1), the semantic contribution of the head noun to the compound meaning (Word2), and the compositionality of the compound as a whole (Phrase).

Corpus and Co-Occurrence Vector Space
As corpus data for our vector-space variants we used one of the currently largest webcorpora for English: ENCOW16 2 containing ≈9.6 billion words (Schäfer and Bildhauer, 2012;Schäfer, 2015). We applied the TreeTagger for partof-speech (pos) tagging and lemmatisation (Schmid, 1994), and we created frequency lists for all corpus lemmas and lemma-pos combinations. As basis for our vector-space variants, we created a co-occurrence matrix for the gold-standard compounds and their constituents using a standard 10-word window (left+right) across the lemmatised ENCOW16. The window was applied within-sentence because the corpus is sentence-shuffled, such that going beyond sentence border is not meaningful. Since our target compounds are open compounds (with spaces) we pre-processed the corpus by joining all space-separated instances of the compounds in the corpus to represent a single token when running the window counts. The resulting target-context matrix contains 90 compound and 168 constituents targets (i.e., a total of 258 targets) as rows and 64,508 context dimensions across parts-of-speech as columns.

Vector-Space Variants
Based on the general co-occurrence matrix described in the previous Section 3.2., we systematically created vectorspace reductions across kinds, i.e., exploring part-ofspeech-based reduction next to and also in combination with Principal Components Analysis (using Singular Value Decomposition) and word2vec embeddings. In the following, we describe our variants; Table 2 lists the variants accompanied by their dimensionality.

• ALL
As baseline we used the whole co-occurrence matrix.

• POS
We used subsets of the co-occurrence matrix with only context dimensions of specific parts-of-speech (specifying on nouns vs. verbs), 3 and from specific frequency ranges, as previously done by Schulte im Walde et al. (2013) in a similar way. Since nouns were generally more useful than verbs (see results below), we performed former fine-tuning just on the noun matrix by using only the 1,000/5,000/10,000/. . . /40,000 most frequent nouns from the corpus as context dimensions. 4 • PCA We performed Principle Components Analysis (PCA) using Singular Value Decomposition (SVD) to reduce the dimensionality of the whole matrix and the matrices containing only noun dimensions.
With this PCA-using-SVD method, our matrix M was first decomposed into three matrices: M = U ΣW T (i.e., performing Singular Value Decomposition). Then, when reducing the number of dimensions to k, we sliced U to the first k rows, Σ to the top-left k × k matrix, and W T to the first k columns. Multiplying the three matrices provided a new matrix with less dimensions than previously in M .

• WORD2VEC
We trained a standard word2vec two-layer neural network model (Mikolov et al., 2013) on the ENCOW16 corpus with window size 10 to obtain 300-dimensional word vectors for our compounds and constituents.

Prediction Functions
Relying on the vector-space variants, the cosine determined the distributional similarity between the compounds and their constituents, which was in turn used to predict the semantic relatedness between the compounds and their constituents, assuming that the stronger the distributional similarity (i.e., the higher the cosine values), the stronger the semantic relatedness and therefore the degree of compositionality.
Next to assessing the individual contributions of compound-modifier and compound-head relatedness, we applied the same functions as in Reddy et al. (2011b) to combine the compound-constituent cosine scores for predicting the degree of compositionality of the compounds, as also done in more general terms for in-depth investigations of phrase composite functions (Mitchell and Lapata, 2010;Coecke et al., 2011;Baroni et al., 2014a;Hermann, 2014): WORD1 use only the compound-modifier cosine score WORD2 use only the compound-head cosine score ADD add the compound-modifier and compound-head cosine scores MULT multiply the compound-modifier and compoundhead cosine scores COMB add the compound-modifier, the compound-head and the multiplication of both cosine scores Given that each component within the functions might provide a different weight to the overall prediction, we used a linear regression model in order to predict the f function and to find the corresponding coefficients. After a 3-fold cross-validation with the human judgement, the best result is reported. The vector space predictions were evaluated against the mean human ratings on the degree of compositionality, using the Spearman Rank-Order Correlation Coefficient ρ (Siegel and Castellan, 1988).

Taking Compound and Constituent Properties into Account
In order to zoom into specific strengths of individual vector space variants, we apply the variants to subsets of our compound targets according to the targets' • degree of compositionality, • compound frequency, • modifier productivity, and • head productivity.
For each of these conditions, we created three disjunctive subsets of the 90 compound targets with 30 targets each. The subsets contain the strongest, weakest and in-between targets as based on the respective condition, e.g., regarding the compound frequency condition we distinguish between high-frequency, mid-frequency and low-frequency compounds. The empirical information relies on a refinement of the Reddy et al. dataset by Schulte im Walde et al.
. Table 3 shows the overall results of predicting compositionality across our vector space variants and the prediction functions. The best-performing variants per kind of variation (as separated by horizontal lines) are in bold font.

Experiment Results
We can see that the Word2Vec vector space outperforms all other variants with a correlation of ρ = 0.689. Obviously, this is not only a matter of dimensionality, as each reduction variant exhibits an individual behaviour regarding the optimal number of dimensions: The rather similar next-best results are reached with (i) using the most frequent corpus nouns NN-25000/NN-30000 (which effectively relies on 6,000-7,000 noun dimensions): ρ = 0.663; (ii) using only nouns (NN: all 52,285 of them): ρ = 0.658; and (iii) PCA on the noun-only matrix (NN-PCA), when using 2,000 dimensions: ρ = 0.657. Performing PCA on the whole matrix (All-PCA) is worse, reaching a maximum of ρ = 0.616 with 5,000 dimensions. A purely pos-based reduction for verbs-only reaches ρ = 0.581, in comparison to ρ = 0.658 for nouns-only, thus confirming the study by Schulte im Walde et al. (2013) in that nouns are more reliable than verbs in vector spaces for predicting compositionality. The baseline with using all context dimensions (ρ = 0.630) is worse in comparison to all reduced conditions other than running PCA on the whole matrix. Therefore, next to identifying a clear winner (Word2Vec) we can induce from our results that using only the most frequent noun dimensions is a reasonable alternative.
Regarding the prediction functions, ADD, MULT and COMB (with only marginal differences between them in most cases) are generally outperforming WORD1 and WORD2. So combining the relatedness information for compound-modifier and compound-head pairs is better for the prediction of the overall compounds' degree of compositionality than relying on just one or the other. Note that the predictions using the compound-head information (WORD2) are often strongly below the compoundmodifier predictions (WORD1 PCA with 100 dimensions computed on whole matrix 100 All-PCA-500 PCA with 500 dimensions computed on whole matrix 500 All-PCA-1000 PCA with 1,000 dimensions computed on whole matrix 1,000 All-PCA-2000 PCA with 2,000 dimensions computed on whole matrix 2,000 All-PCA-5000 PCA with 5,000 dimensions computed on whole matrix 5,000 NN-PCA-100 PCA with 100 dimensions computed on noun matrix 100 NN-PCA-500 PCA with 500 dimensions computed on noun matrix 500 NN-PCA-1000 PCA with 1,000 dimensions computed on noun matrix 1,000 NN-PCA-2000 PCA with 2,000 dimensions computed on noun matrix 2,000 NN-PCA-5000 PCA with 5,000 dimensions computed on noun matrix 5,000 Word2Vec word2vec two-layer neural network representation 300  In the following we now zoom into the results for specific subsets of the gold standard, distinguishing between low-/mid-/high-frequency compounds, compounds with low-/mid-/high-productivity modifiers vs. heads, and compounds with low-/mid-/high-compositionality phrases, modifiers and heads. In general, we observed that with training the regression on the whole dataset and testing it on the subsets we obtained the same results as with training the regression on the subsets.
The results on the subsets are shown in Tables 4-9. The best-performing variant per range is in bold font; in addition, the best-performing variant per reduction kind is highlighted by yellow background colour.
Results across Compound Frequency Ranges Zooming into the prediction results for high-, mid and lowfrequency compounds (see Table 4), we first of all observe that Word2Vec by far outperforms the other reduction variants for high-and low-frequency compounds. In addition, the most striking differences in Table 4 in comparison to Table 3 are two-fold: On the one hand we can see that the prediction results for low-frequency compounds are much below those for mid-frequency and high-frequency compounds; only for Word2Vec this is not the case. On the other hand, the (rather low) best prediction results for the low-frequency compounds are achieved by WORD1 and WORD2 (again, this does not apply to Word2Vec but to all other kinds of reduction). Finally, in all but Word2Vec the prediction results for mid-frequency compounds are clearly above those for low-and high-frequency compounds.

Results distinguishing Modifier Productivity Ranges
Zooming into the prediction results for compounds with high-, mid and low-productivity modifiers (see Table 5), we can see that differently to the previous cases here the nounsonly vector space provides the overall best results; this is the case for compounds with low-productivity modifiers. Overall, we can however not observe strong differences across reduction variants: several kinds of spaces are similarly successful across compound subsets. Interestingly, though, we observe much more variability in which prediction functions are best in predicting compositionality for compounds with low-, mid-and high-productive modifiers. Overall, WORD1, ADD, MULT and COMB take turns in being most successful, and there is no subset-function pairing that strikes as a particularly strong combination. So in sum, it is difficult to identify any tendencies of variants across modifier productivity subsets. This insight is in line with our previous work (Schulte im Walde et al., 2016) which also demonstrated that empirical modifier properties do not have a consistent effect on the quality of predicting compound compositionality.
Results distinguishing Head Productivity Ranges In contrast, zooming into the prediction results for compounds with high-, mid and low-productivity heads (see Table 6), we do observe patterns for compound subsets. In all chosen space variants, the prediction is best for compounds with mid-productivity heads, second-best for those with high-productivity heads and worst for those with lowproductivity heads. This is surprising on the one hand, given that mid-range ratings typically show higher standard deviations and less agreement across human raters (Pollock, 2018), so one might consider their degrees of compositionality more difficult to distinguish than others. On the other hand, compounds with low-productivity heads are supposedly more influenced by sparse data in the vectors, and this does not seem to change in dimensionality-reduced vector spaces.
Comparing vector variants and prediction functions, Word2Vec is again the best option but the noun-based variants NN and NN-PCA are similarly successful. ADD, MULT and COMB are mostly the best functions, but in individual low-productivity cases WORD1 and WORD2 are best.
Results distinguishing Compositionality Ranges Finally, Tables 7-9 zoom into prediction results across degrees of compositionality, regarding the compound phrase as a whole (Table 7), the compound-modifier relation (Table 8), and the compound-head relation (Table 9). For predictions across degrees of phrase compositionality (Table 7), Word2Vec is the clear winner for high-and low-compositional compounds, and for mid-compositional compounds both NN-PCA and Word2Vec clearly outperform the other functions. For high-compositional compounds, WORD1 is the best prediction function, so modifiers seem to determine the prediction in highcompositional cases. Otherwise ADD, MULT and COMB represent the best functions, as before.
For compounds with varying modifier or head compositionality the picture is more diverse. What is most interesting here is that for compounds with low-compositional modifiers (Table 8) WORD2 represents the best prediction function, while in most cases in Table 9, WORD1 represents the best prediction function. We interpret this behaviour as follows: For compounds with low-compositional modifiers the semantic relatedness compound-modifier is low, and here the strength of semantic relatedness compound-head (which is effectively WORD2) correlates with the degree of compositionality of the phrase. Thus, in cases with low compound-modifier relatedness the degree of compositionality of the compound phrase and the compound-head pair are similar in their ranks across compounds. When investigating compounds with varying degrees of head compositionality this effect even applies across compound-head ranges of compositionality, i.e., the strength of semantic relatedness compound-modifier (which is effectively WORD1) correlates with the degree of compositionality of the phrase, so the degree of compositionality of the compound phrase and the compound-modifier pair are similar in their ranks within all three ranges.

Summary and Conclusion
This study provided a systematic evaluation of vector-space reductions across kinds, i.e., exploring part-of-speechbased reduction, Principal Components Analysis using Singular Value Decomposition, and word2vec embeddings.   Relying on the gold standard of English noun compounds by Reddy et al. (2011b), our vector-space variant experiments identified word2vec with 300 dimensions as the clear winner. Similarly good and stable predictions have been achieved when using a large subset of context nouns (in our case relying on the ca. 25,000-30,000 most frequent out of a total of ca. 50,000 noun types), with or without any further PCA reduction.
Zooming into prediction functions and compound and constituent properties, we further demonstrated that -while the overall best predictions are performed with function combination (addition, multiplication, combination of both)-the picture varies strongly across subsets representing different ranges of compositionality, frequency and productivity: 1. Predictions for low-frequency compounds are much worse, and predictions for mid-frequency compounds are much better than on average.
2. There are no obvious tendencies across modifier productivity ranges, but for head productivity ranges we observe very high prediction results for midproductivity, very low prediction results for lowproductivity, and medium prediction results for highproductivity subsets.     Table 9: Results distinguishing head compositionality ranges.
3. For compounds with low compound-modifier relatedness the compound-head relatedness can be used for predicting the overall compound phrase compositionality; even stronger, the compound-modifier relatedness can be used for predicting the overall compound phrase compositionality for compounds across compound-head relatedness ranges.
Many of these insights correpond to those in Schulte im Walde et al. (2016) and once more emphasise the importance of balancing target properties in gold standards. Especially the latter results call for further work on other datasets and across languages.