Elsevier

Knowledge-Based Systems

Volume 152, 15 July 2018, Pages 107-116
Knowledge-Based Systems

Phrase embedding learning based on external and internal context with compositionality constraint

https://doi.org/10.1016/j.knosys.2018.04.009Get rights and content

Abstract

Different methods are proposed to learn phrase embedding, which can be mainly divided into two strands. The first strand is based on the distributional hypothesis to treat a phrase as one non-divisible unit and to learn phrase embedding based on its external context similar to learn word embedding. However, distributional methods cannot make use of the information embedded in component words and they also face data spareness problem. The second strand is based on the principle of compositionality to infer phrase embedding based on the embedding of its component words. Compositional methods would give erroneous result if a phrase is non-compositional. In this paper, we propose a hybrid method by a linear combination of the distributional component and the compositional component with an individualized phrase compositionality constraint. The phrase compositionality is automatically computed based on the distributional embedding of the phrase and its component words. Evaluation on five phrase level semantic tasks and experiments show that our proposed method has overall best performance. Most importantly, our method is more robust as it is less sensitive to datasets.

Introduction

Phrases, as one kind of language units, play an important role in many NLP applications such as machine translation, web searching and sentiment analysis [1]. Generally speaking, phrases can be categorized as either compositional or non-compositional. For compositional phrases, such as traffic light, swimming pool, their semantics are composed from the semantics of its component words. We define component words as the internal context of a phrase. For non-compositional phrases, such as multiword expressions couch potato and kick the bucket, their semantics are generally not directly related to the semantics of their component words. According to [2], in a corpus with a collection of web pages, about 15% of word tokens belong to multiword expressions, 57% of sentences and 88% documents contain at least one multiword expression.

With the success of word embedding as a latent low dimensional vector [3] to represent words, embedding representation has been proposed for other areas, such as network embedding [4] and user embedding [5], etc. Different models are also proposed to learn phrase embedding. Phrase embedding uses two main approaches.

The first one is called the distributional approach which is developed based on the distributional hypothesis that words occurring in similar contexts tend to have similar meanings [6]. This kind of context is referred to as external contexts, which indicates the surrounding words of a phrase. We use the term distributional embedding to refer to embedding obtained by the distributional approach. Methods based on the distributional approach treat a phrase as one single unit and learn embeddings the same way as learning word embedding [7], [8], [9]. However, distributional embedding suffers from data sparseness problem. This is because distributional methods are based on the contexts of a target word. For words with lower frequency of occurrences, there are insufficient number of word-context pairs. Data sparseness problem is more serious at phrase level compared to that of word level. For phrases that are indeed compositional, the semantic information contained in component words are totally ignored. For example, both traffic and light are frequently used words and their embeddings can be very useful in forming the meaning of the phrase traffic light. But, non-compositional methods do not make use of such information.

The second approach, referred to as the compositional approach, is based on the principle of compositionality [10] that the meaning of an expression is composed from the meanings of its constituents and the internal structure. We use compositional embedding to refer to embeddings obtained by the compositional approach. This kind of methods compute phrase embedding from the embeddings of the component words based on some composition function [11], [12], [13], [14]. One problem with this approach is that the embedding learned for non-compositional phrases are incorrect, and thus this approach fails for non-compositional phrases. For example, the meaning of the phrase monkey business is not related to the meanings of monkey and business. Thus any composition function based on the embeddings of the component words will lead to erroneous results.

We argue that both the internal contexts and external contexts are useful for inferring phrase embedding. The usefulness of internal contexts depends on the compositionality of the phrases. If a phrase is compositional, both the internal contexts and external contexts should be used to take advantage of the all the information available for its representation. If a phrase is non-compositional, the representations of component words will not be useful and the phrase representation should be inferred from its external contexts only. The issue is that the choice of which approach to use is dependent on the proportion of compositional phrases in the dataset. This information, however, is not priori knowledge known to applications.

Based on the above analysis, we propose a hybrid model by a linear combination of both a distributional component and a compositional component with an individualized compositionality constraint. Compositionality is a value to indicate to what extent the semantics of a phrase can be inferred from that of its component words. The more compositional a phrase is, the larger is its compositionality value. For a non-compositional phrase, its compositionality should be low. Thus, in the hybrid model, its semantics should mainly be determined by its external contexts through the distributional component only. For a compositional phrase, its compositionality should be high. Both distributional component and compositional component can be used together. The hybrid model is designed to overcome the drawbacks of both distributional approach and compositional approach. The key for the hybrid model to work is how to learn an appropriate compositionality for each phrase. A constant value to all phrases obviously should not do the trick. In this work, we use two methods to learn the compositionality for each phrase using measures between distributional embeddings of a phrase and its component words.

To evaluate the performance of our proposed model, we applied our phrase embedding results in different down stream tasks using five datasets. Evaluations show that our model has a overall best performance. More importantly, our model is the most robust as it is less sensitive to datasets than the baseline methods.

The rest of the paper is organized as follows. Section 2 introduces related works. Section 3 presents our proposed hybrid model. Section 4 gives performance evaluation, and Section 5 concludes this paper.

Section snippets

Embedding representation

Representing objects in a latent space has a long history, such as Latent Semantic Analysis which represents a document as a latent vector [15]. Word embedding, as one kind of latent representation, represents a word as a low-dimensional and dense vector to encode semantic information. Methods for learning word embedding can either be count-based or prediction-based [16]. Count-based methods first build word-context as a statistic matrix where each entry in the matrix can be co-occurrence

Proposed framework

For a given phrase, our proposed model is shown in Fig. 1, which consists of two parts: the distributional component based on the distributional hypothesis and the compositional component based on the principle of compositionality. The two parts are linearly combined with a fixed weight λ and a phrase specific compositionality weight t. λ is a hyper-parameter controlling the overall contribution of each component. The compositionality t is a value range from 0 to 1 where 0 indicates that the

Experiment

In this section, we evaluate the representations by the proposed phrase embedding learning model on five different phrase level semantic tasks including both English and Chinese. For all experiments based on English text, Wikipedia August 2016 dump2 is used as our training corpus. In pre-processing, pure digits and punctuations are removed and all English words are converted to lowercase. The final corpus consists of about 3.2 billion words. During

Conclusion and future work

In this paper, a hybrid model, D&C, is proposed to learn the representation of phrases from both their external contexts and internal contexts through a weighted linear combination with a phrase specific constraint. Instead of a simple combination of the two kinds of information, the individualized compositionality measures from lexical semantics are used to serve as the constraint. Evaluations on five phrase semantic analysis tasks show that the proposed hybrid model performs better than other

References (53)

  • A. Moreno-Ortiz et al.

    Managing multiword expressions in a lexicon-based sentiment analysis system for Spanish

    Proceedings of the Ninth Workshop on Multiword Expressions, MWE@NAACL-HLT

    (2013)
  • N. Schneider et al.

    Comprehensive Annotation of multiword expressions in a social web corpus

    Proceedings of the Ninth International Conference on Language Resources and Evaluation

    (2014)
  • T. Mikolov et al.

    Distributed representations of words and phrases and their compositionality

    Proceedings of Advances in Neural Information Processing Systems

    (2013)
  • A. Grover et al.

    node2vec: scalable feature learning for networks

    Proceedings of the Twenty Second ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

    (2016)
  • Y. Yu et al.

    User embedding for scholarly microblog recommendation

    Proceedings of the Fifty Fourth Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

    (2016)
  • Z.S. Harris

    Distributional structure

    Word

    (1954)
  • T. Mikolov et al.

    Efficient estimation of word representations in vector space

    CoRR

    (2013)
  • W. Yin et al.

    An exploration of embeddings for generalized phrases

    Proceedings of the ACL, Student Research Workshop

    (2014)
  • W. Yin et al.

    Discriminative phrase embedding for paraphrase identification

    Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

    (2015)
  • G. Frege

    The foundations of arithmetic

  • J. Mitchell et al.

    Composition in distributional models of semantics

    Cognit. Sci.

    (2010)
  • M. Yu et al.

    Learning composition models for phrase embeddings

    Trans. Assoc. Comput. Linguist.

    (2015)
  • R. Socher et al.

    Reasoning with neural tensor networks for knowledge base completion

    Proceedings of the Advances in Neural Information Processing Systems

    (2013)
  • S. Wang et al.

    Comparison study on critical components in composition model for phrase representation

    ACM Trans. Asian Low-Resour. Lang. Inf. Process.

    (2017)
  • D.M. Blei

    Build, compute, critique, repeat: data analysis with latent variable models

    Annu. Rev. Stat. Appl.

    (2014)
  • M. Baroni et al.

    Don’t count, predict! a systematic comparison of context-counting vs. context-predicting semantic vectors

    Proceedings of the Fifty Second Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

    (2014)
  • O. Levy et al.

    Improving distributional similarity with lessons learned from word embeddings

    Trans. Assoc. Comput. Linguist.

    (2015)
  • O. Levy et al.

    Neural word embedding as implicit matrix factorization

    Proceedings of the Advances in Neural Information Processing Systems

    (2014)
  • O. Levy et al.

    Dependency-Based Word Embeddings

    Proceedings of the Fifty Second Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

    (2014)
  • M. Faruqui et al.

    Improving vector space word representations using multilingual correlation

    Proceedings of the Fourteenth Conference of the European Chapter of the Association for Computational Linguistics

    (2014)
  • Z. Wang et al.

    Knowledge graph and text jointly embedding

    Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP)

    (2014)
  • M. Faruqui

    Diverse context for learning word representations

    (2016)
  • O. Melamud et al.

    The role of context types and dimensionality in learning word embeddings

    Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

    (2016)
  • K.A. Nguyen et al.

    Integrating distributional lexical contrast into word embeddings for antonym-synonym distinction

    Proceedings of the Fifty Fourth Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

    (2016)
  • V. Shwartz et al.

    Improving hypernymy detection with an integrated path-based and distributional method

    Proceedings of the Fifty Fourth Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

    (2016)
  • F. Sun et al.

    Inside out: two jointly predictive models for word representations and phrase representations.

    Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence

    (2016)
  • Cited by (14)

    • TransPhrase: A new method for generating phrase embedding from word embedding in Chinese

      2021, Expert Systems with Applications
      Citation Excerpt :

      Huang, Ji, Yao, Huang, & Chen, 2016) proposed to combine the phrase vector calculated by the distribution method with the phrase vector calculated by addition and multiplication to obtain a new phrase representation. ( Li, Lu, Xiong, & Long, 2018) proposed three hybrid methods, which can automatically adjust the ratio of the distribution method to the composition method, which is the best at present. Since 2018, pre-training language models have become the most popular text representation and have brought great performance improvement to natural language processing tasks.

    • Phrase embedding learning from internal and external information based on autoencoder

      2021, Information Processing and Management
      Citation Excerpt :

      In the Chinese experiment, we use the DSG method (Song et al., 2018). D&C-C: (Li et al., 2018) proposed three kinds of phrase representation methods that can effectively combine distributed phrase vectors and constituent word vectors, which are currently excellent phrase representation methods. The D&C-C method is the first of them.

    • A novel community answer matching approach based on phrase fusion heterogeneous information network

      2021, Information Processing and Management
      Citation Excerpt :

      Pennington, Socher and Manning (2014) utilized a global log-bilinear regression model to combine comprehensive matrix factorization and local context window for phrase embedding. Li, Lu, Xiong and Long (2018) represented the phrase embedding through a linear combination of distributed components. Stein, Jaques and Valiati (2019) investigated the application of phrase embeddings on automatic document classification tasks.

    View all citing articles on Scopus
    View full text