Phrase embedding learning based on external and internal context with compositionality constraint
Introduction
Phrases, as one kind of language units, play an important role in many NLP applications such as machine translation, web searching and sentiment analysis [1]. Generally speaking, phrases can be categorized as either compositional or non-compositional. For compositional phrases, such as traffic light, swimming pool, their semantics are composed from the semantics of its component words. We define component words as the internal context of a phrase. For non-compositional phrases, such as multiword expressions couch potato and kick the bucket, their semantics are generally not directly related to the semantics of their component words. According to [2], in a corpus with a collection of web pages, about 15% of word tokens belong to multiword expressions, 57% of sentences and 88% documents contain at least one multiword expression.
With the success of word embedding as a latent low dimensional vector [3] to represent words, embedding representation has been proposed for other areas, such as network embedding [4] and user embedding [5], etc. Different models are also proposed to learn phrase embedding. Phrase embedding uses two main approaches.
The first one is called the distributional approach which is developed based on the distributional hypothesis that words occurring in similar contexts tend to have similar meanings [6]. This kind of context is referred to as external contexts, which indicates the surrounding words of a phrase. We use the term distributional embedding to refer to embedding obtained by the distributional approach. Methods based on the distributional approach treat a phrase as one single unit and learn embeddings the same way as learning word embedding [7], [8], [9]. However, distributional embedding suffers from data sparseness problem. This is because distributional methods are based on the contexts of a target word. For words with lower frequency of occurrences, there are insufficient number of word-context pairs. Data sparseness problem is more serious at phrase level compared to that of word level. For phrases that are indeed compositional, the semantic information contained in component words are totally ignored. For example, both traffic and light are frequently used words and their embeddings can be very useful in forming the meaning of the phrase traffic light. But, non-compositional methods do not make use of such information.
The second approach, referred to as the compositional approach, is based on the principle of compositionality [10] that the meaning of an expression is composed from the meanings of its constituents and the internal structure. We use compositional embedding to refer to embeddings obtained by the compositional approach. This kind of methods compute phrase embedding from the embeddings of the component words based on some composition function [11], [12], [13], [14]. One problem with this approach is that the embedding learned for non-compositional phrases are incorrect, and thus this approach fails for non-compositional phrases. For example, the meaning of the phrase monkey business is not related to the meanings of monkey and business. Thus any composition function based on the embeddings of the component words will lead to erroneous results.
We argue that both the internal contexts and external contexts are useful for inferring phrase embedding. The usefulness of internal contexts depends on the compositionality of the phrases. If a phrase is compositional, both the internal contexts and external contexts should be used to take advantage of the all the information available for its representation. If a phrase is non-compositional, the representations of component words will not be useful and the phrase representation should be inferred from its external contexts only. The issue is that the choice of which approach to use is dependent on the proportion of compositional phrases in the dataset. This information, however, is not priori knowledge known to applications.
Based on the above analysis, we propose a hybrid model by a linear combination of both a distributional component and a compositional component with an individualized compositionality constraint. Compositionality is a value to indicate to what extent the semantics of a phrase can be inferred from that of its component words. The more compositional a phrase is, the larger is its compositionality value. For a non-compositional phrase, its compositionality should be low. Thus, in the hybrid model, its semantics should mainly be determined by its external contexts through the distributional component only. For a compositional phrase, its compositionality should be high. Both distributional component and compositional component can be used together. The hybrid model is designed to overcome the drawbacks of both distributional approach and compositional approach. The key for the hybrid model to work is how to learn an appropriate compositionality for each phrase. A constant value to all phrases obviously should not do the trick. In this work, we use two methods to learn the compositionality for each phrase using measures between distributional embeddings of a phrase and its component words.
To evaluate the performance of our proposed model, we applied our phrase embedding results in different down stream tasks using five datasets. Evaluations show that our model has a overall best performance. More importantly, our model is the most robust as it is less sensitive to datasets than the baseline methods.
The rest of the paper is organized as follows. Section 2 introduces related works. Section 3 presents our proposed hybrid model. Section 4 gives performance evaluation, and Section 5 concludes this paper.
Section snippets
Embedding representation
Representing objects in a latent space has a long history, such as Latent Semantic Analysis which represents a document as a latent vector [15]. Word embedding, as one kind of latent representation, represents a word as a low-dimensional and dense vector to encode semantic information. Methods for learning word embedding can either be count-based or prediction-based [16]. Count-based methods first build word-context as a statistic matrix where each entry in the matrix can be co-occurrence
Proposed framework
For a given phrase, our proposed model is shown in Fig. 1, which consists of two parts: the distributional component based on the distributional hypothesis and the compositional component based on the principle of compositionality. The two parts are linearly combined with a fixed weight λ and a phrase specific compositionality weight t. λ is a hyper-parameter controlling the overall contribution of each component. The compositionality t is a value range from 0 to 1 where 0 indicates that the
Experiment
In this section, we evaluate the representations by the proposed phrase embedding learning model on five different phrase level semantic tasks including both English and Chinese. For all experiments based on English text, Wikipedia August 2016 dump2 is used as our training corpus. In pre-processing, pure digits and punctuations are removed and all English words are converted to lowercase. The final corpus consists of about 3.2 billion words. During
Conclusion and future work
In this paper, a hybrid model, D&C, is proposed to learn the representation of phrases from both their external contexts and internal contexts through a weighted linear combination with a phrase specific constraint. Instead of a simple combination of the two kinds of information, the individualized compositionality measures from lexical semantics are used to serve as the constraint. Evaluations on five phrase semantic analysis tasks show that the proposed hybrid model performs better than other
References (53)
- et al.
Managing multiword expressions in a lexicon-based sentiment analysis system for Spanish
Proceedings of the Ninth Workshop on Multiword Expressions, MWE@NAACL-HLT
(2013) - et al.
Comprehensive Annotation of multiword expressions in a social web corpus
Proceedings of the Ninth International Conference on Language Resources and Evaluation
(2014) - et al.
Distributed representations of words and phrases and their compositionality
Proceedings of Advances in Neural Information Processing Systems
(2013) - et al.
node2vec: scalable feature learning for networks
Proceedings of the Twenty Second ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
(2016) - et al.
User embedding for scholarly microblog recommendation
Proceedings of the Fifty Fourth Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
(2016) Distributional structure
Word
(1954)- et al.
Efficient estimation of word representations in vector space
CoRR
(2013) - et al.
An exploration of embeddings for generalized phrases
Proceedings of the ACL, Student Research Workshop
(2014) - et al.
Discriminative phrase embedding for paraphrase identification
Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
(2015) The foundations of arithmetic
Composition in distributional models of semantics
Cognit. Sci.
Learning composition models for phrase embeddings
Trans. Assoc. Comput. Linguist.
Reasoning with neural tensor networks for knowledge base completion
Proceedings of the Advances in Neural Information Processing Systems
Comparison study on critical components in composition model for phrase representation
ACM Trans. Asian Low-Resour. Lang. Inf. Process.
Build, compute, critique, repeat: data analysis with latent variable models
Annu. Rev. Stat. Appl.
Don’t count, predict! a systematic comparison of context-counting vs. context-predicting semantic vectors
Proceedings of the Fifty Second Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Improving distributional similarity with lessons learned from word embeddings
Trans. Assoc. Comput. Linguist.
Neural word embedding as implicit matrix factorization
Proceedings of the Advances in Neural Information Processing Systems
Dependency-Based Word Embeddings
Proceedings of the Fifty Second Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
Improving vector space word representations using multilingual correlation
Proceedings of the Fourteenth Conference of the European Chapter of the Association for Computational Linguistics
Knowledge graph and text jointly embedding
Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP)
Diverse context for learning word representations
The role of context types and dimensionality in learning word embeddings
Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Integrating distributional lexical contrast into word embeddings for antonym-synonym distinction
Proceedings of the Fifty Fourth Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
Improving hypernymy detection with an integrated path-based and distributional method
Proceedings of the Fifty Fourth Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Inside out: two jointly predictive models for word representations and phrase representations.
Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence
Cited by (14)
Transformer based contextual text representation framework for intelligent information retrieval
2024, Expert Systems with ApplicationsA novel topic clustering algorithm based on graph neural network for question topic diversity
2023, Information SciencesTransPhrase: A new method for generating phrase embedding from word embedding in Chinese
2021, Expert Systems with ApplicationsCitation Excerpt :Huang, Ji, Yao, Huang, & Chen, 2016) proposed to combine the phrase vector calculated by the distribution method with the phrase vector calculated by addition and multiplication to obtain a new phrase representation. ( Li, Lu, Xiong, & Long, 2018) proposed three hybrid methods, which can automatically adjust the ratio of the distribution method to the composition method, which is the best at present. Since 2018, pre-training language models have become the most popular text representation and have brought great performance improvement to natural language processing tasks.
Phrase embedding learning from internal and external information based on autoencoder
2021, Information Processing and ManagementCitation Excerpt :In the Chinese experiment, we use the DSG method (Song et al., 2018). D&C-C: (Li et al., 2018) proposed three kinds of phrase representation methods that can effectively combine distributed phrase vectors and constituent word vectors, which are currently excellent phrase representation methods. The D&C-C method is the first of them.
A novel community answer matching approach based on phrase fusion heterogeneous information network
2021, Information Processing and ManagementCitation Excerpt :Pennington, Socher and Manning (2014) utilized a global log-bilinear regression model to combine comprehensive matrix factorization and local context window for phrase embedding. Li, Lu, Xiong and Long (2018) represented the phrase embedding through a linear combination of distributed components. Stein, Jaques and Valiati (2019) investigated the application of phrase embeddings on automatic document classification tasks.
Kernel compositional embedding and its application in linguistic structured data classification
2020, Knowledge-Based Systems