A Proof-Theoretic Approach to Scope Ambiguity in Compositional Vector Space Models

We investigate the extent to which compositional vector space models can be used to account for scope ambiguity in quantified sentences (of the form"Every man loves some woman"). Such sentences containing two quantifiers introduce two readings, a direct scope reading and an inverse scope reading. This ambiguity has been treated in a vector space model using bialgebras by (Hedges and Sadrzadeh, 2016) and (Sadrzadeh, 2016), though without an explanation of the mechanism by which the ambiguity arises. We combine a polarised focussed sequent calculus for the non-associative Lambek calculus NL, as described in (Moortgat and Moot, 2011), with the vector based approach to quantifier scope ambiguity. In particular, we establish a procedure for obtaining a vector space model for quantifier scope ambiguity in a derivational way.


Introduction
There is a long standing tradition in formal semantics on compositionality: to separate the meaning of basic elements (lexical semantics) from the construction of higher-level meaning (derivational semantics) one assigns a homomorphism from a syntactic algebra to a semantic algebra. Having been rigorously formalised by Montague in his seminal papers [14,15], these ideas have been made concrete in the field of typelogical grammar, where syntactic types are mapped onto semantic types so that any derivation gives rise to a meaning recipe. Traditionally, meaning is taken to be a linear lambda term that evaluates to a truth value.
Ongoing research on distributional semantics, based on the idea that word meaning is defined relative to a word's context, has revealed an appealing way to incorporate typelogical grammar into distributional models [5]. This approach, also known as the DisCoCat approach (Distributional Compositional Categorical models) treats compositionality in the Montagovian style as a functorial passage from syntactic types and proofs to vectors and linear maps. Given that this line of research is still in its early phase, there is much to be done to formalise details of the model, give accounts for semantic phenomena, and evaluate the effectiveness of the chosen approach.
Though traditional categorial syntax and semantics go hand in hand, some aspects of the set-theoretic formal semantics go lost in the switch to a vector space model of meaning. First, the interpretation of constants that one can appeal to in a formal semantics are not directly available in a vector based setting; a logical word like "not" can be computed in the formal setting by taking set complement, but negating a vector or matrix is not trivial 1 . Similarly, for coordinators like "and" and "or" the standard set intersection and union are not available in a vectorial setting. One could replace intersection by vector multiplication and union by vector summation, but in the presence of concrete distributional vectors it is not clear that such operations indeed perform well in an experimental setting. Second, the DisCoCat approach assumes a tight categorical correspondence between a syntactic formalism and the concrete vector semantics: when we want to stay in the realm of finite dimensional vector spaces, we are dealing with a compact closed category; to model a categorial grammar as a category, one needs to fully explicate its proof-theoretic logical and structural rules, an exposition that is not trivially available for any categorial system 2 . Another issue with this categorical treatment is that a simple vector based model does not have the non-linearity that some models would assume. As an example, allowing a non-linearity in lexical lambda terms or as a syntactic mechanism means the copying of material which is not possible with all vectors. We discuss this issue in more detail in the rest of the paper. Some of the above issues have been addressed in recent work by [20,8,19], giving accounts of subject/object relativisation, generalised quantifiers, and quantifier scope. In [20], the meaning of pronoun relative clauses is explained by using Frobenius algebras in the lexicon, and assigning different pregroup grammar types to the subject relative pronoun "who" and the object relative pronoun "whom". Two different derivations then naturally arise, giving an intersectional meaning to subject relative clauses like "Men who like Mary", and object relative clauses like "Men whom Mary likes". Such an approach does not lend itself to certain Germanic languages where the ambiguity has to be derivational: in Dutch, both the subject relative and object relative interpretations share the surface form "Mannen die Marie mogen". To deal with this without specifying lexical alternatives, i.e. different possible typings of the relative pronoun "die", [17] provide a derivational account that results in the same intersective vector space meaning as the ones of [20].
An element that lacks in the results obtained so far on quantifier scope ambiguity is a detailed discussion of the derivational process, giving rise to ambiguities. Quantifier scope ambiguity as opposed to pronoun relativisation is more pressing as the scope ambiguity exists in English, and does not come from the lexicon, but rather different ways of reading the same surface form. The account of [8] explores the use of bialgebras to represent quantifiers, using context free grammars as the syntactic engine; its follow up discusses scope ambiguity but assumes the ambiguity to be given before detailing the direct scope and inverse scope readings of phrases of the shape "Every man loves some woman". In order to explain how the ambiguity comes about, we need to detail the syntactic process, and integrate it with a vector based semantics.
Our goal in this paper, then, is to pave the way to fully explain compositionality in vector space models of meaning while also taking into account the desirable mechanisms of e.g. Frobenius algebras and bialgebras. Our step in this paper, is to show how we can represent quantifier scope ambiguity in a derivational manner, fully determined by the syntactic process combined with a suitable lexical semantics.
We will make use of a polarised non-associative Lambek calculus, and use focussing as a technique to gain control over the space of sequent derivations. A continuation-passing-style translation from syntactic types into semantic objects then gives rise to the expected reading for quantifier scope ambiguity. This technique has been worked out by [16] (following [3] and [2]), but has not until now been put in the context of vector space models. This paper is structured as follows: in Section 2, we will briefly discuss quantifier scope ambiguity and its apparent non-linearity. Next, in Section 3 we define the basic, compositional DisCoCat model. We proceed to review quantifier scope ambiguity in vector space models in Section 4, and show in Section 5 how we can derive quantifier scope ambiguity in a compositional way using a polarised focussed sequent calculus that is interpreted in a vector space model. We conclude in Section 6 by explaining how our results can be further expanded and we introduce some potential new areas of investigation.

Quantifier Scope Ambiguity
There seems to be an intrinsic non-linearity associated with quantifiers. Consider the word "all" in a phrase "all men sleep". One way of modelling the universal quantification in the phrase is to let "all" refer to an operation that decides whether the set of "men" is a subset of those entities that are sleeping, i.e. if "men" refers to some set A, and "sleep" to some set B, then "all men sleep" computes whether A ⊆ B. This can be given an alternative definition: When one tries to give this interpretation in terms of a λ-term, the usual approach is to model both "men" and "sleep" as a characteristic function of a set of entities, where "all" will be given a non-linear λ-term: This λ-term will effectively decide whether A ⊆ B, or alternatively whether A = A ∩ B. Both the modellings sacrifice linearity in a sense: where the first, relational interpretation needs to use A as an operand to the intersection operation and as an argument to decide equality, the second interpretation has to copy the variable x to decide whether everything in the universe satisfying the property Q also satisfies P . We argue that this required non-linearity that is introduced by allowing non-linear λ-terms to be inserted through the lexicon, is exactly the same kind of non-linearity that is introduced to vector space models by means of bialgebra operations. It has been argued before that modelling quantification in vector space models forces one to use non-linear maps [7]. However, this issue has been partially resolved by [8] when one admits a powerset structure to the basis vectors of the model. The then obtained bialgebra operations are linear in the algebraic sense, but non-linear in terms of typing information. That is, they allow for copying a resource X into a resource X ⊗ X and deleting a resource in the opposite direction. That this kind of operation would jeopardize a Lambek style grammar formalism is immediate as the bialgebra operations would correspond to contraction and expansion, respectively. Our argument will proceed by claiming that a continuation-passing-style translation that allows for lexical insertion of non-linear λ-terms can instead be interpreted by means of the bialgebra operations of [8].

Compositional Distributional Semantics
Compositional distributional semantics in a categorical setting takes a mathematically rigorous approach to compositionality. Much like traditional Montagovian semantics, there is a syntactic algebra involved that provides grammaticality by means of a proof system, in this case it can be either pregroup grammar or the Lambek calculus [12,13]. The semantic algebra is, in the basic setup, the category of finite dimensional vector spaces, denoted FVect: content words are assigned a vector that represents its position in the space of word meanings, obtained through some method of co-occurrence extraction on a corpus. Whenever a sequence of words, annotated with their syntactic types, leads to a derivation that proves grammaticality, the proof term associated with that derivation provides a linear map on the vectors associated with basic words, which after evaluation gives us the phrase meaning of that sequence of words.

Lambek grammars
We make the model sketched above concrete by giving the relevant definitions.
These are based on work by [23] in combination with the work of [4].

Definition 1 (Lambek types)
Given a set T of basic types, the set of Lambek types F (T ) is the smallest set such that: We proceed to define a Lambek calculus in terms of a labelled deductive system, i.e. we use the notation of an inference system to show how proofs are derived: Definition 2 (Non-associative Lambek calculus) The (non-associative, nonunitary) Lambek calculus NL over T is given by the types in F (T ) and the proofs generated by the following (labelled) inference system: One can show that monotonicity laws for each of the connectives are derived rules of inference: where we have that Leaving aside the issue of global associativity and its desirability from a linguistic perspective, we note how it can be added using two additional axioms: The categorical version of the Lambek calculus can be obtained by imposing the relevant standard equivalences on proofs, amongst others stipulating that composing with the identity proof is a vacuous operation, and that all two-way inference rules are isomorphims. For more detail we refer the reader to [23].
To make grammaticality judgments to sequences of words, we need a lexicon assigning types to words over an alphabet. For the sake of completeness we define the lexicon as a relation, but in the remainder of this paper we will freely abuse notation and treat the lexicon as if it were a function.

Definition 3 (Lexicon) Let Σ be a finite, non-empty set of words (an alphabet). A lexicon over
Definition 4 (Lambek grammar) Given a set of basic types T , a Lambek grammar over T is a triple (Σ, δ, S) where Σ is an alphabet, δ is a lexicon over T , and S ∈ F (T ) is a distinguished goal type.
Definition 5 (Grammaticality) Given a Lambek grammar (Σ, δ, S) over T , we say that a sequence of words w 1 ...w n over Σ is grammatical iff there is a merged sequence W 1 ⊗ W 2 ... ⊗ W n (where for each i we have w i δW i ), and there exists a proof of W 1 ⊗ W 2 ... ⊗ W n → S in the Lambek calculus.
The presented definitions so far give a procedure to obtain a proof of sentencehood for a sequence of words. Moreover, there might be several proofs of the same sequence of words. This may be desirable (in cases of derivational ambiguity) or not (in the case proof-theoretic redundancy, e.g. the successive to and fro use of two-way rules). In the categorical variant of the Lambek calculus, we can simply take the proofs of sentencehood of a sequence to be the hom-set of morphisms Hom(W 1 ⊗ W 2 ... ⊗ W n , S). This produces fewer proofs as unnecessary ambiguity of the proof system is brought down by categorical equations. The structure of the (non-associative) Lambek calculus NL is that of a biclosed magmatic category 3 .

Finite dimensional vector space models
Lambek grammars are easily interpretable in vector space semantics as vector spaces enjoy compact closure, a weaker variant of the bi-closure of the Lambek calculus. We define the category FVect and show that it enjoys compact closure: Definition 6 (Compact Closure) A compact closed category is a monoidal category C with dual objects A l , A r for every object A in C and additional morphisms In the category of finite dimensional vector spaces FVect we have that the dual space A * is isomorphic to A when we fix a basis (which is the case for concrete models). The ǫ and η maps, now reduced to just two maps, are given by In concrete vector models, we will have vectors learnt for content words. For instance, the noun phrases "John" and "Mary" can be interpreted as vectors − → n 1 , − → n 3 ∈ N, respectively. This means that they are essentially single points in a vector space. Setting the sentence space to be the real numbers, a transitive verb like "loves" would live in the vector space N ⊗ R ⊗ N, and would carry information about the degree with which individuals love one another. In vector terms: The c ij is the respective degree for any pair of individuals i, j. The meaning of the phrase "John loves Mary" should then reduce to taking the inner product of the noun phrases with the verbs and so should give In the next section, we show how to relate derivations in a Lambek grammar to concrete computations in a vector space model.

Interpretation
Given that the compact closedness of FVect instantiates the closure of the Lambek calculus, we can easily interpret proofs in a Lambek grammar in a vector space model by passing from words and their lexical types to vectors in a homomorphically obtained vector space. Any proof of grammaticality will be interpreted through the η and ǫ maps: is the relational image of δ, such that I 0 respects typing and I 1 respects lexical type assignment. That is, and An interpretation map sends words to vectors that respect the syntactic types associated with those words. We need to give a vectorial interpretation of proofs as well, in order to know how to compute meanings of a tuple of vectors. The identity proof and transitivity of proofs carries over to the identity map on vector spaces and the composition of linear maps. The remaining rules of residuation are interpreted as shown below: It is a nice puzzle for the reader to verify that by the yanking equations, we preserve isomorphicity of residuation, for example one can show that the interpretation of ⊲ -1 ⊲ f is equal to the interpretation of f .

Illustration
Recall that we have vectors for "John", "Mary" and "loves" and we have an intended meaning of the phrase "John loves Mary". We take a Lambek grammar over the set of basic types {np, s}, where np will be interpreted as N and s will be mapped to R. We define a lexicon as follows: Given that ⊳ -1 ⊲ -1 (1 (np\s)/np ) proves grammaticality of "John loves Mary", the associated meaning computation will be This result is exactly the intended meaning we wanted to obtain. Note that the result of the computation relies on the fact that the content words in the vector space model are taken to be the basis vectors, hence they are orthogonal. The result c 13 indicates the distributional strength of John loving Mary in a corpus that the vectors have been learnt from. Until now, we have neglected discussion about function words: logical words, relative pronouns, and quantifiers are not intuitively represented well by co-occurrence data. The logical word "and" may occur with many different words, but that statistic does not tell us much about the meaning of the word. So although all the basic operations from a Lambek grammar are directly interpretable in vector space models, more advanced semantic phenomena lack an explanation in the simple models.

Quantifier Scope Ambiguity in Vector Space Models
In this section we review the use of bialgebras in vector space models as exhibited by [8,19] and show how the two scope readings can be obtained. The treatment of quantifiers in vector space models relies on the use of powersets to function. As long as we can know of our vector space that its basis vectors are given by the powerset of some set A, we can perform additional operations on the vector space.
Definition 8 (Bialgebra) Given a symmetric monoidal category (C, ⊗, I, σ), a bialgebra on an object X in C is a tuple of maps that satisfy the conditions of a monoid for (X, µ, ζ) and a comonoid for (X, δ, ι) and furthermore satisfy the bialgebra axioms: The last of the four equations tells us that in a bialgebra, the order of copying and merging is irrelevant given that we can switch copies by means of the symmetry of the category. What is interesting to note is that any powerset P (U ) bears a bialgebra structure if we consider the Cartesian product to be the tensor and the singleton set {⋆} as the identity object. What follows is that any vector space over a powerset, denoted V P (U) , carries a bialgebra structure. Both bialgebras are given below: The existence of a bialgebra on powerset vector spaces allows for a neat treatment of quantification. Given that nouns and noun phrases are represented as vectors on a powerset, universal quantification and existential quantification are treated as To get a feel for how the meaning of a quantified sentence should be computed according to [8], we show the example of "all men sleep", which gets assigned the meaning Although this approach works for statements with a single quantifier, it fails to deliver both reading for a doubly quantified statement such as "every student likes some teacher" as the computations for the subject and object quantifiers will be independent of each other. Hence, both readings will collapse to the same meaning. This lack of explanatory power of the model is amended in a subsequent paper [19], where the implicit quantified variable is passed on to the computation of the second quantifier. For a transitive verb such as "likes", that is modelled as an element , we can model the forward image of an element in U as The backward image is computed similarly by taking the inner product of − → v a with − → v j . This construction now allows for both readings of "every student likes some teacher", though there is no procedure given to obtain these readings through a syntactic process.

Quantifier Scope Ambiguity using Focussing and Polarisation
Focussing is a proof-theoretic technique stemming from the work of [1] that aims to eliminate redundancy from regular sequent systems. Focussed proof search proceeds by distinguishing those formulas that enjoy invertible introduction rules (asynchronous formulas), and those that do not (synchronous formulas). Asynchronous formulas are decomposed in a backward chaining proof search until there is no more decomposition possible. Then, one of the synchronous formulas is selected to be put in focus, after which the process of decomposition continues. This implies that now only the number of synchronous formulas determines the number of distinct proofs. This approach has been applied to the Lambek-Grishin calculus, a symmetric extension of the Lambek calculus, by [3], and is worked out in more in detail by [16]. In order to obtain a compositional Montagovian semantics from a display style presentation of focussed proofs for the Lambek-Grishin calculus, [2] applies a polarisation technique, whereby formulas are assigned either positive or negative polarity. Atomic formulas are assigned an arbitrary polarity; the choice of this bias affects the set of proofs obtained. The polarity also influences semantics: under the continuation semantics of [3], a negative formula will be negated in its interpretation. Though the focussing and polarisation approach are described by [3] and [2], respectively, here we follow the focussed sequent presentation of [16]. We start by defining polarity of types:

Focused types are positive
Definition 9 (Polarity) Given a set of basic types T , a polarity assignment on types is a map pol : F (T ) → {−, +} that assigns to the types in T an arbitrary polarity but fixes the polarity for complex types: Given a Lambek grammar G over a set T , grammaticality is defined similarly to Definition 5, where the set of proofs is given by the underlying proof system. The only difference is that the final sequent should have the consequent formula in focus. The proof is encoded by the abstract label of the proof, according to the abstract sequent system defined in Figure 1.

CPS translation
The translation of types and proofs given by [16] into a target semantic algebra is a two-step process: Instead of considering a proof to be a simple transformation of values (the assumptions) to a value (the conclusion), we consider a proof to be a continuation, a function that awaits an evaluation context to compute a final value. The intermediate semantics is the Lambek calculus with permutation and negation, LP ⊗,⊥ , a system that only uses a product operation but introduces a negation. Furthermore, permutation of resources is allowed to compensate for the lack of directionality without the /, \ connectives. We will define a direct mapping from source to target, to skip the administrative details of the intermediate semantics.
In order to replicate the effect of the negation in LP ⊗,⊥ , we use vector spaces over sets; given some type A, we define its interpretation to be a vector space over a set. In this way, we enjoy the bialgebras defined over those vector spaces. First, a type W is mapped to some set A, using the Cartesian product and powerset operations. Then, the final interpretation of a type will be the vector space over the given set, V A . We get the intended tensor products on spaces due to the fact that Definition 10 (Type interpretation) Given a set of basic types T and a basic interpretation map I 0 : T → Set, the type interpretation is a map I 1 : F (T ) → Set defined as follows: 1. For basic types p ∈ T we have For complex types, the interpretation depends both on the polarity of subtypes and the connective involved: 3. We stipulate that for any type A, its interpretation I 1 (A) is lifted to the vector space spanned by its elements, that is we define the final interpretation I 2 : F (T ) → FVect as I 2 (W ) = V I1(W ) .

Definition 11 (Word interpretation) Given a
Lambek grammar (Σ, δ, S) over a set of basic types T and an interpretation map I 2 : F (T ) → I 2 (δ(Σ)), where δ(Σ) is the relational image of Σ under the lexicon, and I 2 (δ(Σ)) is the image under interpretation (i.e. vector spaces) the word interpretation is a map I 3 that respects the following: iff w δ W and pol(W ) = + I 3 (w) ∈ I 2 (W ) → R iff w δ W and pol(W ) = − That is, words with a positive type are translated as vectors, while words with a negative type are translated as linear maps.
As an example, if we define the associated vector space of the type np to be U and n to be P (U ), then the interpretation of a noun like "student" will be a constant I 3 ("student") ∈ V P (U) , whereas a word like "all" that is typed np/n will be a linear map We proceed to define how we interpret proof terms. The intuitive idea is that a proof term is translated into a linear map that will subsequently be applied to the word interpretations of its antecedents. Though the proof system builds up terms with potentially unbound variables, we require for grammaticality (see above) that the conclusion formula be in focus; this means that the only unbound variables in the proof term are those of the antecedent formula, which will be substituted by word interpretations.
Definition 12 (Proof term interpretation) Given a proof in the focussed sequent calculus for NL, there is a proof term that encodes the proof. We define the interpretation of a proof by giving the translation of proof terms into linear maps: Finally, as the translation is a continuation-passing-style translation, we will end up with a map that need an evaluation context before finishing computation. So, given that a proof gives a linear map, we apply it to the identity map, and we instantiate the unbound variables with the relevant word interpretations.

Deriving quantifier scope ambiguity
Quantifier scope ambiguity as exemplified by the phrase "Every student likes some teacher" is already shown to be obtainable using the two-step translation process in [3] in a Lambek-Grishin grammar, and in a Lambek grammar [16]. Here, we alter the latter example given to translate into the vector space model as employed by [8,19] to show that both readings (narrow/wide and wide/narrow) can be obtained and give exactly the kind of meaning we would expect from a vector space model. This means that we can obtain the intended meaning in a derivational way. What is more, given that we have both a grammar available and we have learned concrete vectors, the process can potentially be fully automated. Each word has to be associated with a syntactic type, and we have to give a word interpretation mapping the words to a vector or linear map. We assume a set of basic types {np, n, s} where s is the distinguished goal type. Polarity assignment is handled by stipulating that np and n are positive, and s is negative. Basic types np and n are interpreted as U and P (U ), respectively, and s gets translated to R. The syntactic types and the word interpretation is given by the following table: As a reminder, we also note the vectorial interpretation of lexical constants in the word interpretation: The two proofs that we get from the focussed sequent calculus are displayed in Figures 2 and 3 (without labelling). If we take the proof term for the first proof and translate this into a vectorial map we get For the second proof term, we get a slightly different map: The unfolded maps are quite intimidating so the complete computation is taken up in the appendix. Here we just note that the two maps reduce to the readings shown below: We can see that these interpretations will give different results depending on the instantiation of the vectors. In fact, these interpretations correspond to the result of [19]. This effectively shows that quantifier scope ambiguity can be achieved in vector space models by the use of appropriate proof-theoretic notions.

Concluding Remarks
In this paper, we elaborated on quantifier scope ambiguity in compositional distributional models of meaning. In particular, the approach of [16] using a continuation-passing-style translation for a polarised and focussed proof system for the Lambek calculus was combined with the approach to generalised quantifiers of [8]. The result is a fully derivational and provides a fully worked out compositional way to obtain ambiguous meaning for phrases like "Every student likes some teacher", thereby resolving the issue of manually assigning appropriate meaning vectors to such phrases.
Although we illustrate with examples of two generalised quantifiers in a sentence, the approach works for a single quantifier, and since the applied strategy exploits the combinatorial choices of the proof system (focus on the first quantifier and then on the second one, or vice versa) we expect the approach to generalise to more quantifiers, though the possibility of overgeneration needs to be investigated.
As for experimental validation, since the writing of this paper, it has been recognised that using a powerset construction in vector spaces, to be able to make use of bialgebras, may not be very feasible in practical models: having a powerset as a basis may lead to an exponential blowup in vector space size, and could potentially give sparsity issues. One approach to deal with this could be to use fuzzy quantification [25], which has already been explored by [6].
Another interesting avenue is to work out how several phenomena involving the copying of linguistic material can be analysed in a compositional distributional model. Coordination and pronoun relativisation have been given an account using Frobenius algebras over vector spaces [9,20], where the Frobenius operations allow one to express element wise multiplication on arbitrary tensors. In future work we hope to analyse ellipsis, a phenomenon for which it can be argued that copying has to be part of the syntactic process. Rules of controlled copying then can be interpreted using the Frobenius or bialgebra operations. A first step has already been taken by [10], and we wish to approach the problem from the typelogical perspective.