Learning Lexical Subspaces in a Distributional Vector Space

Abstract In this paper, we propose LexSub, a novel approach towards unifying lexical and distributional semantics. We inject knowledge about lexical-semantic relations into distributional word embeddings by defining subspaces of the distributional vector space in which a lexical relation should hold. Our framework can handle symmetric attract and repel relations (e.g., synonymy and antonymy, respectively), as well as asymmetric relations (e.g., hypernymy and meronomy). In a suite of intrinsic benchmarks, we show that our model outperforms previous approaches on relatedness tasks and on hypernymy classification and detection, while being competitive on word similarity tasks. It also outperforms previous systems on extrinsic classification tasks that benefit from exploiting lexical relational cues. We perform a series of analyses to understand the behaviors of our model.1 Code available at https://github.com/aishikchakraborty/LexSub.


Introduction
Pre-trained word embeddings are the bedrock of modern natural language processing architectures. This success of pre-trained word embeddings is attributed to their ability to embody the distributional hypothesis (Harris, 1954;Firth, 1957), which states that ''the words that are used in the same contexts tend to purport similar meanings'' (Harris, 1954).
The biggest strength of the embedding methodstheir ability to cluster distributionally related words-is also their biggest weakness. This contextual clustering of words brings together words that might be used in a similar context in the * Equal contribution. 1 C o d e a v a i l a b l e a t https://github.com/ aishikchakraborty/LexSub. text, but that might not necessarily be semantically similar, or worse, might even be antonyms (Lin et al., 2003).
Several techniques have been proposed in the literature to modify word vectors to incorporate lexical-semantic relations into the embedding space (Yu and Dredze, 2014;Xu et al., 2014;Fried and Duh, 2014;Faruqui et al., 2015;Mrkšić et al., 2016;Glavaš and Vulić, 2018). The common theme of these approaches is that they modify the original distributional vector space using auxiliary lexical constraints to endow the vector space with a sense of lexical relations. However, a potential limitation of this approach is that the alteration of the original distributional space may cause a loss of the distributional information that made these vectors so useful in the first place, leading to degraded performance when used in the downstream tasks.
This problem could be further exacerbated when multiple relations are incorporated, especially as different lexical-semantic relations have different mathematical properties. For example, synonymy is a symmetric relation, whereas hypernymy and meronymy are asymmetric relations. It would be difficult to control the interacting effects that constraints induced by multiple relations could have on the distributional space.
The solution that we propose is to enforce a separation of concerns, in which distributional information is addressed by a central main vector space, whereas each lexical relation is handled by a separate subspace of the main distributional space. The interface between these components is then a projection operation from the main distributional space into a lexical subspace. Our framework, LEXSUB, thus formulates the problem of enforcing lexical constraints as a problem of learning a Figure 1: A concept diagram contrasting other post-hoc approaches with our LEXSUB framework. Our LEXSUB framework enforces the lexical constraints in lexical relation-specific subspaces, whereas the other approaches try to learn lexical relations in the original distributional vector space.
linear subspace for each of the lexical relations within the distributional vector space. Figure 1 shows a conceptual diagram of the relationship between the distributional space and the lexical subspaces in LEXSUB.
We show that LEXSUB outperforms previous methods in a variety of evaluations, particularly on intrinsic relatedness correlation tasks, and in extrinsic evaluations in downstream settings. We also show that LEXSUB is competitive with existing models on intrinsic similarity evaluation tasks. We run a series of analyses to understand why our method improves performance in these settings.
Our experimental results suggest that explicitly separating lexical relations into their own subspaces allows the model to better capture the structure of each lexical relation without being polluted by information from the distributional space. Conversely, the main distributional vector space is not polluted by the need to model lexical relations in the same space, as is the case for previous models. Furthermore, the explicit linear projection that is learned ensures that a relation-specific subspace exists in the original distributional vector space, and can thus be discovered by a downstream model if the extrinsic task requires knowledge about lexical-semantic relations.
Contributions. In summary, we propose LEXSUB, a framework for learning lexical linear subspaces within the distributional vector space. The proposed framework can model all major kinds of lexical-semantic relations, namely, attract-sym-metric, repel-symmetric, and attract-asymmetric. We demonstrate that our approach outperforms or is competitive with previous approaches on intrinsic evaluations, and outperforms them on a suite of downstream extrinsic tasks that might benefit from exploiting lexical relational information. Finally, we design a series of experiments to better understand the behaviors of our model and provide evidence that the separation of concerns achieved by LEXSUB is responsible for its improved performance.

Related Work
Several approaches have been proposed towards unifying the lexical and distributional semantics. These approaches can broadly be classified into two categories: 1) post-hoc, and 2) ad-hoc approaches. Post-hoc approaches finetune pre-trained embeddings by fitting them with lexical relations. On the other hand, ad-hoc models add auxiliary lexical constraints to the distributional similarity loss. Both post-hoc and ad-hoc approaches rely on lexical databases such as WordNet (Miller, 1995), FrameNet (Baker et al., 1998), BabelNet (Navigli and Ponzetto, 2012), and PPDB (Ganitkevitch et al., 2013;Pavlick et al., 2015) for symbolically encoded lexical relations that are translated into lexical constraints. These lexical constraints endow the embeddings with lexical-semantic relational information.
Post-hoc Approaches. In the post-hoc approach, pre-trained word vectors such as GloVe (Pennington et al., 2014), Word2Vec (Mikolov et al., 2013), FastText (Bojanowski et al., 2017), or Paragram (Wieting et al., 2015 are fine-tuned to endow them with lexical relational information (Faruqui et al., 2015;Rothe and Schütze, 2015;Wieting et al., 2015;Mrkšić et al., 2016Jo, 2018;Jo and Choi, 2018;Glavaš and Vulić, 2018). In this paper, we primarily discuss LEXSUB as a post-hoc model. This formulation of LEXSUB is similar to the other post-hoc approaches mentioned above with the significant difference that the lexical relations are enforced in a lexical subspace instead of the original distributional vector space. Rothe et al. (2016) explores the idea of learning specialized subspaces with to reduce the dimensionality of distributional space such that it maximally preserves relevant task-specific information at the expense of distributional information. Unlike Rothe et al. (2016), our proposed method tries to retain the distributional information in the embeddings so that they can be used as a general-purpose initialization in any NLP pipeline. Embeddings from Rothe et al. (2016)'s method can only be used for the task on which they were trained.
Ad-hoc Approaches. The ad-hoc class of approaches add auxiliary lexical constraints to the distributional similarity loss function, usually, a language modeling objective like CBOW (Mikolov et al., 2013) or recurrent neural network language model (Mikolov et al., 2010;Sundermeyer et al., 2012). These constraints can either be viewed as a prior or as a regularizer to the distributional objective (Yu and Dredze, 2014;Xu et al., 2014;Kiela et al., 2015a;Fried and Duh, 2014). In other work, the original language modeling objective is modified to incorporate lexical constraints (Liu et al., 2015;Osborne et al., 2016;Bollegala et al., 2016;Ono et al., 2015;Nguyen et al., 2016Nguyen et al., , 2017Tifrea et al., 2018). We discuss the ad-hoc formulation of LEXSUB in Appendix A.
An alternate axis along which to classify these approaches is by their ability to model different types of lexical relations. These types can be enumerated as symmetric-attract (synonymy), symmetric-repel (antonymy), and asymmetricattract (hypernymy, meronymy). Most approaches mentioned above can handle symmetric-attract type relations, but only a few of them can model other types of lexical relations. For example, Ono et al. (2015) can exclusively model antonymy, Tifrea et al. (2018) and Nguyen et al. (2017) can only model hypernymy whereas Mrkšić et al. (2016);  can model synonymy and antonymy, and  can handle synonymy, antonymy, and hypernymy relations. Our proposed framework can model all types of lexical relations, namely, symmetric-attract, symmetric-repel, and asymmetric-attract, and uses of all four major lexical relations found in lexical resources like WordNet, namely, synonymy, antonymy, hypernymy, and meronymy, and could flexibly include more relations. To our knowledge, we are the first to use meronymy lexical relations.
Other Approaches. Several approaches do not fall into either of the categories mentioned above. A subset of these approaches attempts to learn lexical relations, especially hypernymy, directly by embedding a lexical database, for example, Poincaré Embeddings (Nickel and Kiela, 2017) or Order-Embeddings (Vendrov et al., 2015). Another set of approaches, like DIH (Chang et al., 2018) or Word2Gauss (Vilnis and McCallum, 2014;Athiwaratkun and Wilson, 2017) attempt to learn the hypernymy relation directly from the corpus without relying on any lexical database. The third set of approaches attempt to learn a scoring function over a sparse bag of words (SBOW) features. These approaches are summarized by Shwartz et al. (2017).

Task Definition
Given a vocabulary set V {x 1 , x 2 , x 3 , . . . .x n }, our objective is to create a set of vectors {x 1 , x 2 , x 3 , . . . , x n } ∈ R d that respect both distributional similarity as well as lexicalsemantic relations. We refer to these vectors as the main vector space embeddings. Let R be the relation set corresponding to a lexical-semantic relation r. The elements of this relation set are ordered pairs of words (x i , x j ) ∈ V × V ; that is, if (x i , x j ) ∈ R, then x i and x j are related by the lexical relation r. For symmetric relations like synonymy and antonymy, (x i , x j ) ∈ R implies (x j , x i ) ∈ R. Similarly, for asymmetric relations like hypernymy and meronymy, x j is related to x i by relation r if (x i , x j ) ∈ R and (x j , x i ) / ∈ R. Our model has two components. The first component helps the model learn the lexical subspaces within the distributional vector space. These subspaces are learned using a loss function L lex defined in Section 3.2.4. The second component helps the model learn the distributional vector space. The training of this vector space is aided by a loss function L dist defined in Section 3.3. The total loss that we optimize is therefore defined as: L total = L dist +L lex .
Distance Function. In the subsequent subsections, we will build lexical subspace distance functions using the cosine distance function, d(x, y) = 1 − x · y/( x y ) where x and y are embeddings for the word x and y, respectively.

Learning Lexical Subspaces in the Distributional Space
In this section, we discuss three types of abstract lexical losses-attract symmetric, attract asymmetric, and repel symmetric-that are commonly found in lexical databases like WordNet. We then discuss a negative sampling loss that prevents the model from finding trivial solutions to the lexical objective.

Abstract Lexical Relation Loss
Let x i and x j be a pair of words related by a lexical relation r. We project their embeddings x i , x j ∈ R d to an h-dimensional lexical subspace (h < d) using a learned relation-specific projection matrix W proj r with dimensions h × d. The distance between any two words x i and x j in the lexical subspace is defined as a distance between their projected embeddings. We define this lexicorelational subspace specific distance function d proj The lexical subspaces can be categorized into three types: attract symmetric, attract asymmetric, and repel symmetric. In an attract symmetric subspace, the objective is to minimize the distance between the lexically related word pair x i and x j . The corresponding loss function is: Similarly, for repel symmetric lexical relations such as antonymy, the goal is to maximize the distance (up to a margin γ) between the two projected embeddings. We define a repel loss for r, L rep r , as: (3) In the case of attract asymmetric relations, we encode the asymmetry of the relationship between x i and x j by defining an asymmetric distance function d asym r in terms of this affine transformation of embeddings of x i and x j as: where W asym r (an h × d matrix) and b asym r (an h-dimensional vector) are the parameters of the affine function.
The attract asymmetric loss function is then defined in terms of d asym r as: The first term of the L att-asym r brings x i 's projected embedding closer to the embedding of x j . The second term avoids the trivial solution of parameterized affine function collapsing to a identity function. This is achieved by maximizing the distance between x i and the affine projection of x j .

Negative Sampling
We supplement our lexical loss functions with a negative sampling loss. This helps avoid the trivial solutions such as all words embeddings collapsing to a single point for attract relations and words being maximally distant in the repel subspace.
We generate negative samples by uniformly sampling n words from the vocabulary V . For attract subspaces (both attract symmetric and attract asymmetric), we ensure that negatively sampled words in the subspace are at a minimum distance δ min r from x i . Similarly, for repel subspaces, we ensure that negative samples are at a distance of at-most δ max r from x i . The attract and repel negative sampling losses are: r where x l indicates the negative sample drawn from a uniform distribution over vocabulary.

Relation-Specific Losses
Synonymy Relations. As synonymy is an attract symmetric relation, we use L attr-sym syn as our lexical loss and L attr-neg syn as our negative sampling loss, with the negative sampling loss weighted by a negative sampling ratio hyperparameter µ. L syn = L attr-sym syn + µL attr-neg syn (6) Antonymy Relations. Antonymy relation is the mirror image of the synonymy relation; hence, we use the same subspace for both the relations; (i.e., W proj ant = W proj syn ). As antonymy is a repel lexical relation, we use L rep syn as our lexical loss and L rep-neg syn as our negative loss.
Hypernymy Relations. Hypernymy is an attract asymmetric relation, hence, we use L attr-asym hyp as the lexical loss and L attr-neg hyp as negative sampling loss.
Meronymy Relations. Meronymy is also an attract-asymmetric relation. Therefore, in a similar manner, the lexical loss will be L attr-asym mer and negative sampling loss will be L attr-neg mer : L mer = L attr-asym mer + µL attr-neg mer (9)

Total Lexical Subspace Loss
Based on the individual lexical losses defined above, the total lexical subspace loss defined as follows: where ν syn , ν ant , ν hyp , ν mer ∈ [0, 1] are lexical relation ratio hyperparameters weighing the importance of each of the lexical relation.

Preserving the Distributional Space
In the post-hoc setting, we start from pre- The L dist component aims to minimize the change in L2 distance between the word embeddings in order to preserve the distributional information in the pre-trained embeddings:

Overall Loss Function
The overall loss of LEXSUB is L total = L dist +L lex .

Training Setup
In this section, we describe the datasets and models that we use in our experiments. The output of our model is the main vector space embedding that is endowed with the specialized lexical subspaces. All our evaluations are done on the main vector space embeddings unless stated otherwise.

Training Dataset
Our experiments were conducted using GloVe embeddings (Pennington et al., 2014) of 300dimension trained on 6 billion tokens from the Wikipedia 2014 and Gigaword 5 corpus. The vocabulary size for GloVe embeddings is 400,000.

Lexical Resource
We use WordNet (Miller, 1995) as the lexical database for all experiments. We consider all four types of lexical relations: synonymy, antonymy, hypernymy, and meronymy. Only those relation triples where both words occur in the vocabulary are considered. We consider both instance and concept hypernyms for hypernymy relations, and for meronomy relations, part, substance, as well as member meronyms were included as constraints. Table 1 shows the relation-wise split used in the experiments.
We rely on the validation sets corresponding to our extrinsic tasks (Section 6.2) for choosing these hyperparameter values. We ran a grid search on the hyperparameter space and selected the final set of hyperparameters by first ranking validation results for each task in descending order, then calculating the mean rank across the tasks. We selected the hyperparameters that achieved the best (i.e., lowest) mean rank.

Baselines
Vanilla. The Vanilla baselines refer to the original GloVe word embeddings without any lexical constraints.
Retrofitting. Retrofitting (Faruqui et al., 2015) uses similarity constraints from lexical resources to pull similar words together. The objective function that retrofitting optimizes consists of a reconstruction loss L dist and a symmetric-attract Counterfitting. Counterfitting (Mrkšić et al., 2016) builds up on retrofitting but also support repel symmetric relations. Their objective function consists of three parts: Synonym Attract, Antonym Repel, and a Vector Space Preservation loss, similar to L syn att-sym , L syn rep-sym , and L dist , respectively.
LEAR. LEAR  expands the counterfitting framework by adding a Lexical Entailment (LE) loss. This LE loss encodes a hierarchical ordering between concepts (hyponym-hypernym relationships) and can handle attract asymmetric relations.
We train each of the baseline models using the lexical resources described in Section 4.2. LEAR, LEXSUB, and Counterfitting were trained on all four lexical relations whereas the Retrofitting was trained only on attract relations, namely, synonymy, hypernymy, and meronymy. This is due to Retofitting's inability to handle repel type relations. We also report the results of our experiments with LEXSUB and the baselines trained on the lexical resource from LEAR in Appendix B.

Intrinsic Tasks
Word Similarity Task. We use four popular word similarity test sets to evaluate word similarity. We use the men3k dataset by (Bruni et al., 2014) and the relatedness section of the WordSim353 dataset (Agirre et al., 2009) to measure the ability of the embedding's to retain the distributional information. We use the SimLex-999 dataset (Hill et al., 2015) and SimVerb 3500 (Gerz et al., 2016) to evaluate the embedding's ability to detect graded synonymy and antonymy relations. Both the relatedness and similarity tasks were evaluated in the main vector space for LEXSUB.
Hypernymy Tasks. Following Roller et al. (2018), we consider three tasks involving hypernymy: graded hypernymy evaluation, hypernymy classification, and directionality detection. We use the hypernymy subspace embeddings for LEXSUB for these experiments.
For graded hypernymy evaluation, we use the Hyperlex dataset  and report the results on the complete hyperlex dataset. We measure Spearman's ρ between the cosine similarity of embeddings of the word pairs and the human evaluations.
The hypernymy classification task is an unsupervised task to classify whether a pair of words are hypernym/hyponym of each other. We consider four of the five benchmark datasets considered in Roller et al. (2018); namely, BLESS (Baroni and Lenci, 2011), LEDS (Baroni et al., 2012), EVAL (Santus et al., 2014), and WBLESS (Weeds et al., 2014). We do not consider the SHWARTZ dataset (Shwartz et al., 2016), as the number of OOV was high (38% for LEXSUB, Retrofitting, and LEAR and 60% for Counterfitting for GloVe). The evaluation is done by ranking the word pairs by cosine similarity and computing the mean average precision over the ranked list.
The hypernymy directionality detection task is designed to detect which of the two terms is the hypernym of the other; that is, given two words w 1 and w 2 , is w 1 the hypernym of w 2 or vice versa. We consider two of the three datasets from Roller et al., (2018); namely, WBLESS and BIBLESS (Kiela et al., 2015b). The classification setup is similar to Roller et al. (2018) and is done using the open source package provided by the authors. 2

Extrinsic Tasks
We evaluate our embeddings on five extrinsic tasks that could benefit from the lexical relational cues. We do so by injecting our embeddings into recent high-performing models for those tasks. The tasks and models are: NER Classification. We use the CoNLL 2003 NER task (Tjong Kim Sang and De Meulder, 2003) for the Named Entity Recognition (NER) Task. The dataset consists of news stories from Reuters where the entities have been labeled into four classes (PER, LOC, ORG, MISC). We use the model proposed by  for the NER task.
Sentiment Classification. We use the Bi-Attentive Classification Network (BTN) by McCann et al. (2017) to train a sentiment classifier. We train all models for sentiment classification on the Stanford Sentiment Treebank (SST) (Socher et al., 2013). We use a two-class granularity where we remove the ''neutral'' class following McCann et al. (2017) and just use the ''positive'' and ''negative'' classes for classification.
Textual Entailment. For textual entailment experiments, we use the Decomposable Attention model by Parikh et al. (2016) for our experiments. We train and evaluate the models on the Stanford Natural Language Inference (SNLI) dataset (Bowman et al., 2015) using the standard train, test and validation split.
Question Answering. We use the SQUAD1.1 question answering dataset (Rajpurkar et al., 2016). The dataset contains 100k+ crowd-sourced question answer pairs. We use the BiDAF model (Seo et al., 2016) for the question answering task. We report the accuracy on the development set for SQuAD.
Paraphrase Detection. For the paraphrase detection task, we use the BIMPM model by Wang et al. (2017) for our experiments. We train and evaluate the models on the Quora Question Pairs (QQP) dataset 3 using the standard splits.
Method For the above models, we use the reference implementations of the models provided by the AllenNLP toolkit . We replace the input layer of these models with the embeddings we want to evaluate. We use two different setups for our extrinsic experiments and report results for both.
Setup 1: In our first setup, we standardize several representational and training decisions to remove potential confounding effects. This ensures that performance differences in the extrinsic tasks are reflective of the quality of the embeddings under evaluation. We achieve this by making the following changes to all extrinsic task models. First, for the Vanilla models, we use pretrained GloVe embeddings of 300 dimensions, trained on 6 billion tokens. Similarly, we train all post-hoc embeddings using the 6 billion token 300-dimensional pretrained GloVe embeddings and plug these post-hoc embeddings into the extrinsic task model. Second, we remove character embeddings from the input layer. Finally, we do not fine-tune the pretrained embeddings.
Setup 2: In order to demonstrate that we are not unfairly penalizing the base models, we also conduct a second set of experiments where models for all the extrinsic tasks are trained in the original settings (i.e., without the changes mentioned above). In these experiments, we do not remove character embeddings from any model, nor do we put any restrictions on fine-tuning of the pretrained word embeddings. These results for both the experiments are reported in Table 4.

Results
We now report on the results of our comparisons of LEXSUB to Vanilla embeddings and baselines trained on the same lexical resource as LEXSUB. We use the main vector space embeddings in all our experiments except for hypernymy experiments, for which we use the hypernymy space embeddings.
Intrinsic Evaluations. Table 2 shows that our model outperforms the Vanilla baseline on both relatedness and similarity tasks, outperforms all the other baselines on relatedness, and is competitive with the other baselines on all the word similarity tasks. Table 3 demonstrates that we considerably outperform Vanilla as well as other baseline post-hoc methods on hypernymy tasks. Thus, our subspace-based approach can learn lexical-semantic relations and can perform as well or better than the approaches that enforce lexical constraints directly on the distributional space.
Another important result from  on relatedness tasks like men3k and WS-353R. We hypothesize that enforcing symmetricrepel (Counterfitting) and asymmetric-attract (Counterfitting and LEAR) constraints directly on the distributional space leads to distortion of the distributional vector space, resulting in poor performance on relatedness tasks. LEXSUB performs competitively on similarity tasks without sacrificing its performance in relatedness tasks, unlike contemporary methods that sacrifice relatedness by optimizing for similarity.
Extrinsic Evaluations. Table 4 presents the results of the extrinsic evaluations. Rows 3-7 present the results for first setup-that is, experiments without confounds (Setup 1) such as character embeddings and further fine-tuning of the input embeddings. The results for the models trained with the original setting (Setup 2) are presented in rows 9-14. In the original setting, the model for QQP, SQuAD, and NER contains additional trainable character embeddings in the input layer. The original NER model further fine-tunes the input embeddings. In our first set of experiments, we find that the LEXSUB model outperforms the baseline methods on every extrinsic task and Vanilla on every extrinsic task except SNLI. In the case of our second experiment, LEXSUB outperforms previous post-hoc methods in all extrinsic tasks but does worse than GloVe in NER. We hypothesize the relatively poor performance of LEXSUB with respect to GloVe on NER might be due to the task-specific fine-tuning of the embeddings.
In fact, we find that the baseline approaches, with a few exceptions, do worse than Vanilla across the whole suite of extrinsic tasks in both the settings. Taken together, this indicates that our subspace-based approach is superior if the objective is to use these modified embeddings in downstream tasks.
We hypothesize that these results are indicative of the fact that the preservation of distributional information is crucial to the downstream performance of the embeddings. The baseline approaches, Models NER(F1) SST-2(Acc) SNLI(Acc) SQuAD(EM) QQP(Acc)  In both the setups, we see that LEXSUB outperforms the baselines on most of the extrinsic tasks. We hypothesize the relatively poor performance of LEXSUB compared to Vanilla on NER might be due to the task-specific fine-tuning of the embeddings.
which learn the lexical-semantic relations in the original distributional space, disrupt the distributional information, leading to poor extrinsic task performance. We expand on this point in Section 8.3.

State-of-the-Art Results in Extrinsic Tasks.
We have also added the current state-of-theart results for the respective extrinsic tasks in Table 4 (last row). The current state of the art for NER is Baevski et al. (2019). The authors also use the model proposed by   ) are all initialized with contextualized embeddings from a bidirectional transformerbased model trained on a data that is orders of magnitude larger than the GloVe variant used in our experiments. The contextualized embeddings, because of their ability to represent the word in the context of its usage, are considerably more powerful than GloVe, hence the models relying on them are not directly comparable to our model or the other baselines.

Analysis
In this section, we perform several analyses to understand the behaviors of our model and the baselines better, focusing on the following questions: Q1: How well do LEXSUB's lexical subspaces capture the specific lexical relations for which they were optimized, as opposed to the other relations? Q2: Can the lexical subspaces and the manifolds in the main distributional space be exploited by a downstream neural network model? Q3: How well do the models preserve relatedness in the main distributional space? , and d asym r , respectively. We see that most of the closest neighbors in the learned subspace are words that are in the specified lexical relation with the query words.   To systematically quantify these results, we compute the mean average precision (MAP) over the top 100 neighbors for a list of query words. We use the words from the Hyperlex  and Simlex (Hill et al., 2015) datasets as the query words for this experiment. For each query word and for each lexical relation, we obtain a list of words from WordNet which are related to the query word through that particular lexical relation. These words form the gold-standard labels for computing the average precision for the query word. Table 6 shows the MAP scores for the top 100 neighborhood words for the baselines, for LEXSUB, and for its lexical subspaces. The main vector space subspace does worse than all the baselines, which is expected because the baselines learn to fit their lexical relations in the original distributional space. However, if we look at the individual lexical subspaces, we can see that the synonymy, hypernymy, and meronymy subspaces have the best MAP score for their respective relation, demonstrating the separation of concerns property that motivated our approach.

Lexical Relation Prediction Task (Q2)
One of the motivations behind enforcing explicit lexical constraints on the distributional space is to learn lexico-relational manifolds within the distributional vector space. On any such lexicorelational manifold, the respective lexical relation will hold. For example, on a synonymy manifold, all the synonyms of a word would be clustered together and the antonyms would be maximally distant. The deep learning based models then will be able to exploit these lexico-relational manifolds to improve generalization on the downstream tasks. To evaluate this hypothesis, we propose a simplified classification setup of predicting the lexical relation between a given word pair. If a downstream model is able to detect these manifolds, it should be able to generalize beyond the word pairs seen in the training set.
Lexical Relation Prediction Dataset. The lexical relation prediction dataset is composed of word pairs as input and their lexical relation as the target. The problem is posed as a four-way classification problem between the relations synonymy, antonymy, hypernymy, and meronomy. The dataset is collected from WordNet and has a total of 606,160 word pairs and labels split in 80/20 ratio into training and validation. The training set contains 192,045 synonyms, 9,733 antonyms, 257,844 hypernyms, and 25,308 meronyms. Similarly, the validation set by relation split is 96,022 synonyms, 4,866 antonyms, 128,920 hypernyms, and 12,652 meronyms.
We use the word pairs with lexical relation labels from the Hyperlex  as our test set. We only consider synonymy,  Lexical Relation Prediction Model. We use a Siamese Network for the relation classification task. The input to the model is a one-hot encoded word pair, which is fed into the embedding layer. This embedding layer is initialized with the embedding that is to be evaluated and is not fine-tuned during training. This is followed by a 1,500-dimensional affine hidden layer with a ReLU activation function that is shared by both word embeddings. This shared non-linear layer is expected to learn a mapping from the distributional vector space to lexico-relational manifolds within the distributional vector space. The shared layer is followed by two different sets of two-dimensional 125 × 4 affine layers, one for each word. These linear layers are put in place to capture the various idiosyncrasies of lexical relations such as asymmetry and attract and repel nature. Finally, the cosine similarity of the hidden representation corresponding to two words is fed into the softmax layer to map the output to probabilities. The models are trained for 30 epochs using the Adagrad (Duchi et al., 2011) optimizer with an initial learning rate of 0.01 and a gradient clipping ratio of 5.0.  lexical relation manifold to classify word pairs by their lexical relation. The LEXSUB model again outperforms all the baseline models in the task. We hypothesize that this is because LEXSUB learns the lexical relations in a linear subspace which happens to be the simplest possible manifold. Hence, it might be easier for downstream models to exploit it for better generalization.

Preserving the Distributional Space (Q3)
As previously discussed, one of the main motivations of LEXSUB is to separate the learning of lexical relations into subspaces, so that the main distributional vector space is not deformed to as great a degree. We directly measure this deformation by computing the mean shift in the learned embedding space. We define the mean shift as the average L2-distance between the learned and the Vanilla embeddings. We find that the mean shift for LEXSUB is about 30 times lower than the baselines (Table 8). This shows that LEXSUB better preserves the original distributional space, which may explain its better performance in intrinsic relatedness evaluations and extrinsic evaluations.

Conclusion
We presented LEXSUB, a novel framework for learning lexical subspaces in a distributional vector space. The proposed approach properly separates various lexical relations from the main distributional space, which leads to improved downstream task performance, interpretable learned subspaces, and preservation of distributional information in the distributional space.
In future work, we plan to extend our framework to contextualized embeddings and expand the framework to support hyperbolic distances, which  can better model hierarchical relations like hypernymy.
The Ad-hoc Distributional Space. Given a set of tokens in a corpus C = (w 1 , w 2 , . . . , w t ), we minimize the negative log likelihood function: where k is the size of the sequence under consideration, and the conditional probability P is modeled using a neural language model with θ parameters which includes the embedding matrix X ′ = [x ′ 1 , · · · , x ′ n ] T . Ad-hoc LEXSUB Loss. The total loss in case of ad-hoc LEXSUB is thus: L total = L adhoc dist + L lex , where L lex is defined by equation 10.
Training Dataset. The ad-hoc model is trained on the Wikitext-103 dataset . We preprocess the data by lowercasing all the tokens in the dataset across the splits, and limiting the vocabulary to top 100k words.
Ad-Hoc LEXSUB Model. The distributional component of our ad-hoc model is a two-layer QRNN-based language model  with a 300-dimensional embedding layer and a 1,200-dimensional hidden layer. The batchsize, BPTT length, and dropout ratio values for  Table 10: Intrinsic and extrinsic experiment results for baselines and LEXSUB trained with lexical resource from LEAR. We observe a similar trend in the intrinsic and the extrinsic evaluation as to when the models were trained on lexical resources from Section 4.2. This indicates that the LEXSUB stronger performance is due to our novel subspace-based formulation rather than its ability to better exploit a specific lexical resource.
our model are 30, 140, and 0.1 respectively. We train our model for 10 epochs using the Adam (Kingma and Ba, 2014) optimizer with an initial learning rate of 0.001, which is reduced during training by a factor of 10 in epochs 3, 6, and 7. We use the same set of hyperparameters that were used for the post-hoc experiments.
Results Table 9c presents the extrinsic evaluations of the ad-hoc LEXSUB model. Vanilla, in this case, refers to embeddings from the language model trained on Wikitext-103 without any lexical constraints. We observe that ad-hoc LEXSUB outperforms Vanilla on all extrinsic tasks, demonstrating that learning lexical relations in subspaces is also helpful in the ad-hoc setting.
We observe similar gains for ad-hoc LEXSUB on intrinsic evaluation in Table 9a and 9b.

Appendix B: Experiments with Lexical
Resource from  In Section 7, we discussed the performance of LEXSUB and the baselines trained on the lexical resource presented in Section 4.2. In this section, we repeat the same set of experiments but with the LEXSUB and the baselines trained on lexical resource from LEAR, our strongest competitor. The objective of these experiments is to ascertain that the LEXSUB's competitive advantage is due to our novel subspace-based formulation rather than its ability to better exploit the lexical resource discussed in Section 4.2.
The hyperparameters used to train the models is the same as Section 4.3. For baselines, we use the hyperparameters reported in the respective papers. We observe a similar trend in intrinsic and extrinsic evaluation. LEXSUB outperforms all the baselines on relatedness (Table 10a), hypernymy intrinsic tasks (Table 10b), and all the extrinsic tasks (Table 10c). We again observe that LEAR and Counterfitting perform poorly in the relatedness tasks. We suspect the poor relatedness score of LEAR and Counterfitting is because these models distort the original distributional space.