Graded hyponymy for compositional distributional semantics

The categorical compositional distributional model of natural language provides a conceptually motivated procedure to compute the meaning of a sentence, given its grammatical structure and the meanings of its words. This approach has outperformed other models in mainstream empirical language processing tasks, but lacks an effective model of lexical entailment. We address this shortcoming by exploiting the freedom in our abstract categorical framework to change our choice of semantic model. This allows us to describe hyponymy as a graded order on meanings, using models of partial information used in quantum computation. Quantum logic embeds in this graded order.

The categorical compositional distributional model of natural language provides a conceptually motivated procedure to compute the meaning of a sentence, given its grammatical structure and the meanings of its words. This approach has outperformed other models in mainstream empirical language processing tasks, but lacks an effective model of lexical entailment. We address this shortcoming by exploiting the freedom in our abstract categorical framework to change our choice of semantic model. This allows us to describe hyponymy as a graded order on meanings, using models of partial information used in quantum computation. Quantum logic embeds in this graded order. 1 introduction Finding a formalization of language in which the meaning of a sentence can be computed from the meaning of its parts has been a longstanding goal in formal and computational linguistics.
Distributional semantics represents individual word meanings as vectors in finite dimensional real vector spaces. On the other hand, symbolic accounts of meaning combine words via compositional rules to form phrases and sentences. These two approaches are in some sense orthogonal. Distributional schemes have no obvious compositional structure, whereas compositional models lack a canonical way of determining the meaning of individual words. In Coecke et al. (2010), the authors develop the categorical compositional distributional model of natural language semantics. This model exploits the shared categorical structure of pregroup grammars and vector spaces to provide a compositional structure for distributional semantics. It has produced state-of-the-art results in measuring sentence similarity (Kartsaklis et al. 2012;Grefenstette and Sadrzadeh 2011), effectively describing aspects of the human understanding of sentences.
A satisfactory account of natural language should incorporate a suitable notion of lexical entailment. Until recently, categorical compositional distributional models of meaning have lacked this crucial feature. In order to address the entailment problem, we exploit the freedom inherent in our abstract categorical framework to change models. We move from a pure state setting to a category used to describe mixed states and partial knowledge in the semantics of categorical quantum mechanics. Meanings are now represented by density matrices rather than simple vectors. We use this extra flexibility to capture the concept of hyponymy, where one word may be seen as an instance of another. For example, red is a hyponym of colour. The hyponymy relation can be associated with a notion of logical entailment. Some entailment is crisp, for example: dog entails animal. However, we may also wish to permit entailments of differing strengths. For example, the concept dog gives high support to the concept pet, but does not completely entail it: some dogs are working dogs. The hyponymy relation we describe here can account for these phenomena. Some crisp entailment can be seen as encoding linguistic knowledge. The kind of entailment we are interested in here is, in general, about the properties that objects have in the world, rather than grammatically based entailment. In particular, we explicitly avoid downward-monotone contexts such as negation. We do, however, examine the hyponymy between an adjective-noun compound and the head noun. We should also be able to measure entailment strengths at the sentence level. For example, we require that Cujo is a dog crisply entails Cujo is an animal, but that the statement Cujo is a dog does not completely entail Cujo is a pet. Again, the relation we describe here will successfully describe this behaviour at the sentence level. Closely related to the current work are the ideas in Balkır (2014), Balkır et al. (2016), and Sadrzadeh et al. (2018). In this work, the authors develop a graded form of entailment based on von Neumann entropy and with links to the distributional inclusion hypotheses developed by Geffet and Dagan (2005). The authors [ 226 ] show how entailment at the word level carries through to entailment at the sentence level. However, this is done without taking account of the grading. In contrast, the measure that we develop here provides a lower bound for the entailment strength between sentences, based on the entailment strength between words. Some of the work presented here was developed in the first author's MSc thesis (Bankova 2015).
An obvious choice for a logic built upon vector spaces is quantum logic (Birkhoff and von Neumann 1936). Briefly, this logic represents propositions about quantum systems as projection operators on an appropriate Hilbert space. These projections form an orthomodular lattice where the distributive law fails in general. The logical structure is then inherited from the lattice structure in the usual way. In the current work, we propose an order that embeds the orthomodular lattice of projections, and so contains quantum logic. This order is based on the Löwner ordering with propositions represented by density matrices. When this ordering is applied to density matrices with the standard trace normalization, no propositions compare, and therefore the Löwner ordering is useless as applied to density operators. The trick we use is to develop an approximate entailment relationship which arises naturally from any commutative monoid. We introduce this in general terms and describe conditions under which this gives a graded measure of entailment. This grading becomes continuous with respect to noise. Our framework is flexible enough to subsume the Bayesian partial ordering of Coecke and Martin (2011) and provides it with a grading. A procedure is given for determining the hyponymy strength between any pair of phrases of the same overall grammatical type. The pair of phrases can have differing lengths. So, for example, we can compare 'blond men' to 'men', as these are both noun phrases. This is possible because within categorical compositional semantics, phrases of each type are reduced to one common space according to their type, and can be compared within that space. Furthermore, this notion is consistent with hyponymy at the word level, giving a lower bound on phrase hyponymy.
Entailment is an important and thriving area of research within distributional semantics. The PASCAL Recognising Textual Entailment Challenge (Dagan et al. 2006) has attracted a large number of researchers in the area and generated a number of approaches. Previous lines of research on entailment for distributional semantics investigate the development of directed similarity measures which can characterize entailment (Weeds et al. 2004;Kotlerman et al. 2010;Lenci and Benotto 2012). Geffet and Dagan (2005) introduce a pair of distributional inclusion hypotheses, where if a word v entails another word w, then all the typical features of the word v will also occur with the word w. Conversely, if all the typical features of v also occur with w, v is expected to entail w. Clarke (2009) defines a vector lattice for word vectors, and a notion of graded entailment with the properties of a conditional probability. Rimell (2014) explores the limitations of the distributional inclusion hypothesis by examining the properties of those features that are not shared between words. An interesting approach in Kiela et al. (2015) is to incorporate other modes of input into the representation of a word. Measures of entailment are based on the dispersion of a word representation, together with a similarity measure. All of these look at entailment at the word level.
Attempts have also been made to incorporate entailment measures with elements of compositionality. Baroni et al. (2012) exploit the entailment relations between adjective-noun and noun pairs to train a classifier that can detect similar relations. They further develop a theory of entailment for quantifiers. The approach that we propose here has the characteristic that it can be applied to more types of phrases and sentences than just adjective-noun and noun-noun type phrases.
Another approach to compositional vector-based entailment is the use of deep neural networks to represent logical semantics, as in Bowman et al. (2015), for example. The drawback with the use of this sort of method is that the transparency of the compositional method is lost: the networks may indeed learn how to represent logical semantics but it is not clear how they do so. In contrast, the method we propose has a clear basis in formal semantics and links to quantum logic. [ 228 ] categorical compositional distributional meaning Compositional and distributional accounts of meaning are unified in Coecke et al. (2010), constructing the meaning of sentences from the meanings of their component parts using their syntactic structure.

Pregroup grammars
In order to describe syntactic structure, we use Lambek's pregroup grammars (Lambek 1997). Within the standard categorical compositional distributional model, it is possible to move to other forms of categorial grammar, as argued in Coecke et al. (2013). This is due to the fact that the category of finite-dimensional vector spaces is particularly well-behaved, and so grammars with greater or lesser structure may be used. A pregroup (P, ≤, ·, 1, (−) l , (−) r ) is a partially ordered monoid (P, ≤, ·, 1) where each element p ∈ P has a left adjoint p l and a right adjoint p r , such that the following inequalities hold: (1) p l · p ≤ 1 ≤ p · p l and p · p r ≤ 1 ≤ p r · p Intuitively, we think of the elements of a pregroup as linguistic types. The monoidal structure allows us to form composite types, and the partial order encodes type reduction. The important right and left adjoints then enable the introduction of types requiring further elements on either their left or right respectively. The pregroup grammar Preg over an alphabet is freely constructed from the atomic types in . In what follows we use an alphabet = {n, s}. We use the type s to denote a declarative sentence and n to denote a noun. A transitive verb can then be denoted n r sn l . If a string of words and their types reduces to the type s, the sentence is judged grammatical. The sentence John kicks cats is typed n (n r sn l ) n, and can be reduced to s as follows: n (n r sn l ) n ≤ 1 · sn l n ≤ 1 · s · 1 ≤ s This symbolic reduction can also be expressed graphically, as shown in Figure 1. In this diagrammatic notation, the elimination of types by means of the inequalities n · n r ≤ 1 and n l · n ≤ 1 is denoted by a 'cup'. The fact that the type s is retained is represented by a straight wire.

Compositional distributional models
The symbolic account and distributional approaches are linked by the fact that they are both compact closed categories. This compatibility allows the compositional rules of the grammar to be applied in the vector space model. In this way, we can map syntactically well-formed strings of words into one shared meaning space. A compact closed category is a monoidal category in which for each object A there are left and right dual objects A l and A r , and corresponding unit and counit morphisms η l : The underlying poset of a pregroup can be viewed as a compact closed category with the monoidal structure given by the pregroup monoid, and ε l , η l , η r , ε r the unique morphisms witnessing the inequalities of (1).
Distributional vector space models live in the category FHilb of finite dimensional real Hilbert spaces and linear maps. FHilb is compact closed. Each object V is its own dual and the left and right unit and counit morphisms coincide. Given a fixed basis {|v i 〉} i of V , we define the unit by η : → V ⊗ V :: 1 → ∑ i |v i 〉 ⊗ |v i 〉 and counit by ε : V ⊗ V → :: Here, we use the physicists' bra-ket notation, for details see Nielsen and Chuang (2011).

Graphical calculus
The morphisms of compact closed categories can be expressed in a convenient graphical calculus (Kelly and Laplaza 1980) which we will exploit in the following sections. Objects are labelled wires, and morphisms are given as vertices with input and output wires. Composing morphisms consists of connecting input and output wires, and the tensor product is formed by juxtaposition, as shown in Figure 2. By convention the wire for the monoidal unit is omitted. The morphisms ε and η can then be represented by 'cups' and 'caps' as shown in Figure 3. The snake equations can be seen as straightening wires, as shown in Figure 4.

Grammatical Reductions in Vector Spaces
Following Preller and Sadrzadeh (2011), reductions of the pregroup grammar may be mapped onto the category FHilb of finite dimensional Hilbert spaces and linear maps using an appropriate strong monoidal functor Q: Strong monoidal functors automatically preserve the compact closed structure. For our example Preg {n,s} , we must map the noun and sentence types to appropriate finite dimensional vector spaces: Composite types are then constructed functorially using the corresponding structure in FHilb. Each morphism α in the pregroup is mapped to a linear map interpreting sentences of that grammatical type. Then, given word vectors |w i 〉 with types p i , and a type reduction α : p 1 , p 2 , . . . , p n → s, the meaning of the sentence w 1 w 2 . . . w n is given by: For example, as described in Section 2.1, transitive verbs have type n r sn l , and can, therefore, be represented in FHilb as a rank 3 space N ⊗ S ⊗ N . The transitive sentence John kicks cats has type n(n r sn l )n, which reduces to the sentence type via ε r ⊗ 1 s ⊗ ε l . So representing using the definitions of the counits in FHilb we then have: The category FHilb is actually a †-compact closed category. A †-compact closed category is a compact closed category with an additional dagger functor that is an identity-on-objects involution, satisfying natural coherence conditions. In the graphical calculus, the dagger operation "flips diagrams upside-down". In the case of FHilb [ 232 ] the dagger sends a linear map to its adjoint, and this allows us to reason about inner products in a general categorical setting, so that meanings of sentences may be compared using the inner product to calculate the cosine distance between vector representations.
The abstract categorical framework we have introduced allows meanings to be interpreted not just in FHilb, but in any †-compact closed category. We will exploit this freedom when we move to density matrices. Detailed presentations of the ideas in this section are given in Coecke et al. (2010) and Preller and Sadrzadeh (2011) and an introduction to relevant category theory in Coecke and Paquette (2011).

Positive operators and density matrices
The methods outlined in Section 2 can be applied to the richer setting of density matrices. Density matrices are used in quantum mechanics to express uncertainty about the state of a system. For unit vector |v〉, the projection operator |v〉 〈v| onto the subspace spanned by |v〉 is called a pure state. Pure states can be thought of as giving sharp, unambiguous information. In general, density matrices are given by a convex sum of pure states, describing a probabilistic mixture. States that are not pure are referred to as mixed states. Necessary and sufficient conditions for an operator ρ to encode such a mixture are: Operators satisfying the first two axioms are called positive operators. The third axiom ensures that the operator represents a convex mixture of pure states. Relaxing this condition gives us different choices for normalization.

Representing words as positive matrices
Within standard distributional semantics, words are represented as vectors, where the values on specific dimensions correspond to some function of the frequency with which they co-occur with the words represented by the basis vectors. The vector space induced can be modified or reduced using singular value decomposition or other techniques, where the basis vectors no longer have specific meanings. In order to represent words as density matrices, we first observe that each word vector has a corresponding pure matrix: Words which are more general can be built up by taking sums over pure matrices. We can think of the meaning of the word pet as represented by: In general, we consider the meaning of a word w to be given by a collection of unit vectors {|w i 〉} i , where each |w i 〉 represents an instance of the concept expressed by the word. Each |w i 〉 is weighted by p i ∈ [0, 1], such that ∑ i p i = 1. These describe the meaning of w as a weighted combination of exemplars. Then the density operator: This is an extension of the distributional hypothesis. The coefficients p i may be determined as a function of the frequency with which each word represented by a pure matrix co-occurs with the word represented by ⟦w⟧, for example.

3.3
The CPM construction Applying Selinger's CPM construction (Selinger 2007) to FHilb produces a new †-compact closed category in which the states are positive operators. This construction has previously been exploited in a linguistic setting in Kartsaklis (2015), Piedeleu et al. (2015), and Balkır et al. (2016). [ 234 ] Throughout this section denotes an arbitrary †-compact closed category. (Selinger 2007) if there exists C ∈ Ob( ) and k ∈ (C ⊗ A, B), such that φ can be written in the form:

Definition 1 (Completely positive morphism). A -morphism φ :
Identity morphisms are completely positive, and completely positive morphisms are closed under composition in , leading to the following: If is a †-compact closed category then CPM( ) is a category with the same objects as and its morphisms are the completely positive morphisms.
The †-compact structure required for interpreting language in our setting lifts to CPM( ): Theorem 1. CPM( ) is also a †-compact closed category. There is a functor: This functor preserves the †-compact closed structure, and is faithful "up to a global phase" (Selinger 2007).

3.4
Diagrammatic calculus for CPM( ) As CPM( ) is also a †-compact closed category, we can use the graphical calculus described in Section 2.3. By convention, the diagrammatic calculus for CPM( ) is drawn using thick wires. The corresponding diagrams in are given in Table 1.
In the vector space model of meaning the transition between syntax and semantics was achieved by using a strong monoidal functor Q : Preg → FHilb. Language can be assigned semantics in CPM(FHilb) in an entirely analogous way via a strong monoidal functor: Definition 3. Let w 1 , w 2 . . . w n be a string of words with corresponding grammatical types t i in Preg . Suppose that the type reduction is given by t 1 , . . . t n r − → x for some x ∈ Ob(Preg ). Let ⟦w i ⟧ be the meaning of word w i in CPM(FHilb), i.e. a state of the form I → S(t i ). Then the meaning of w 1 w 2 . . . w n is given by: We now have all the ingredients to derive sentence meanings in CPM(FHilb).

Example 1. We firstly show that the results from FHilb lift to CPM(FHilb).
Let the noun space N be a real Hilbert space with basis vectors given by {|n i 〉} i , where for some i, |n i 〉 = |Clara〉 and for some j, |n j 〉 = |beer〉. Let the sentence space be another space S with basis {|s i 〉} i . The verb |likes〉 is given by: Compositional graded hyponymy The density matrices for the nouns Clara and beer are in fact pure states given by: and similarly, ⟦likes⟧ in CPM(FHilb) is: The meaning of the composite sentence is simply (ϵ N ⊗ 1 S ⊗ ϵ N ) applied to (⟦Clara⟧ ⊗ ⟦likes⟧ ⊗ ⟦beer⟧) as shown in Figure 5, with interpretation in FHilb shown in Figure 6. In terms of linear algebra, this corresponds to:

Clara likes beer
This is a pure state corresponding to the vector ∑ q C iq j |s q 〉. However, in CPM(FHilb) we can work with more than the pure states.

Example 2. Let the noun space N be a real Hilbert space with basis vectors given by {|n
Diagrammatically, this is shown in Figure 7.

The sisters enjoy drinks
The impurity is indicated by the fact that the pairs of states are connected by wires (Selinger 2007).

predicates and entailment
If we consider a model of (non-deterministic) classical computation, a state of a set X is just a subset ρ ⊆ X . Similarly, a predicate is a subset A ⊆ X . We say that ρ satisfies A if: which we write as ρ ⊩ A. Predicate A entails predicate B, written A |= B, if for every state ρ: Clearly this is equivalent to requiring A ⊆ B.

The Löwner order
As our linguistic models derive from a quantum mechanical formalism, positive operators form a natural analogue for subsets as our predicates. This follows ideas in D' Hondt and Panangaden (2006) and earlier work in a probabilistic setting in Kozen (1983). Crucially, we can order positive operators (Löwner 1934).

Definition 4 (Löwner order). For positive operators A and B, we define:
If we consider this as an entailment relationship, we can follow our intuitions from the non-deterministic setting. Firstly, we introduce a suitable notion of satisfaction. For positive operator A and density matrix ρ, we define ρ ⊩ A as the positive real number tr(ρA).
This generalizes satisfaction from a binary relation to a binary function into the positive reals. We then find that the Löwner order can equivalently be phrased in terms of satisfaction as follows: Linguistically, we can interpret this condition as saying that every noun, for example, satisfies predicate B at least as strongly as it satisfies predicate A.

Quantum logic
Quantum logic (Birkhoff and von Neumann 1936) views the projection operators on a Hilbert space as propositions about a quantum system. As the Löwner order restricts to the usual ordering on projection operators, we can embed quantum logic within the poset of projection operators, providing a direct link to existing theory.

A general setting for approximate entailment
We can build an entailment preorder on any commutative monoid, viewing the underlying set as a collection of propositions. We then write A |= B and say A entails B if there exists a proposition D such that A + D = B. If our commutative monoid is the powerset of some set X , with union the binary operation and unit the empty set, then we recover our non-deterministic computation example from the previous section. If, on the other hand, we take our commutative monoid to be the positive operators on some Hilbert space, with addition of operators and the zero operator as the monoid structure, we recover the Löwner ordering.
In linguistics, we may ask ourselves: does dog entail pet? Naïvely, the answer is clearly no, not every dog is a pet. This seems too crude for realistic applications though, most dogs are pets, and so we might say dog entails pet to some extent. This motivates our need for an approximate notion of entailment.
For proposition E, we say that A entails B to the extent E if: We think of E as a error term, for instance in our dogs and pets example, E adds back in dogs that are not pets. Expanding definitions, we find A entails B to extent E if there exists D such that: (2) A + D = B + E From this more symmetrical formulation it is easy to see that for arbitrary propositions A, B, proposition A trivially entails B to extent A, as by commutativity: It is therefore clear that the mere existence of a suitable error term is not sufficient for a weakened notion of entailment. If we restrict our attention to errors in a complete meet semilattice A,B , we can take the lower bound on the E satisfying equation (2) as our canonical choice. Finally, if we wish to be able to compare entailment strengths globally, this can be achieved by choosing a partial order of "error sizes" and monotone functions: sending errors to their corresponding size. [ 240 ] For example, if A and B are positive operators, we take our complete lattice of error terms A,B to be all operators of the form (1 − k)A for k ∈ [0, 1], ordered by the size of 1−k. We then take k as the strength of the entailment, and refer to it as k-hyponymy.
In the case of finite sets A, B, we take A,B = (A), and take the size of the error terms as: cardinality of E cardinality of A measuring "how much" of A we have to supplement B with, as indicated in the shaded region below: In terms of conditional probability, the error size is then: These general error terms are strictly more general than the k-hyponymy.
Modelling hyponymy in the categorical compositional distributional semantics framework was first considered in Balkır (2014). She introduced an asymmetric similarity measure called representativeness on density matrices based on quantum relative entropy. This can be used to translate hyponym-hypernym relations to the level of positive transitive sentences. Our aim here will be to provide an alternative measure which relies only on the properties of density matrices and the fact that they are the states in CPM(FHilb). This will enable us to quantify the strength of the hyponymy relationship, described as k-hyponymy. The measure of hyponymy that we use has an advantage over the representativeness measure. Due to the way it combines with linear maps, we can give a quantitative measure to sentence-level entailment based on the entailment strengths between words, whereas representativeness is not shown to combine in this way.

Properties of hyponymy
Before proceeding with defining the concept of k-hyponymy, we give two properties of hyponymy that can be captured by our new measure.
• Asymmetry. If A is a hyponym of B, then usually, B is not a hyponym of A.
• Pseudo-transitivity. If X is a hyponym of Y and Y is a hyponym of Z, then X is a hyponym of Z. However, if the hyponymy is not perfect, then we get a weakened form of transitivity.
The measure of hyponymy that we described above and named khyponymy will be defined in terms of density matrices -the containers for word meanings. The idea is then to define a quantitative order on the density matrices, which is not a partial order, but does give us an indication of the asymmetric relationship between words.

Ordering positive matrices
A density matrix can be used to encode the precision that is needed when describing an action. In the sentence I took my pet to the vet, we do not know whether the pet is a dog, cat, tarantula, and so on. The sentence I took my dog to the vet is more specific. We then wish to develop an order on density matrices so that dog, as represented by |dog〉 〈dog| is more specific than pet as represented by ⟦pet⟧. This ordering may then be viewed as an entailment relation, and entailment between words can lift to the level of sentences, so that the sentence I took my dog to the vet entails the sentence I took my pet to the vet. Note that we do not require that the sentences have exactly the same structure. For example, we would like I took my brown dog to the vet to entail I took my dog to the vet, and we would expect this to happen because brown dog should entail dog.
We now define our notion of approximate entailment, following the discussions of Section 4.3: Definition 5 (k-hyponym). We say that A is a k-hyponym of B for a given value of k in the range (0,1] and write A k B if: Note that such a k need not be unique or even exist at all. [ 242 ] Definition 6 (k ma x hyponym). k ma x is the maximum value of k ∈ (0, 1] for which we have A k ma x B.
In general, we are interested in the maximal value k max for which k-hyponymy holds between two positive operators. This k max value quantifies the strength of the entailment between the two operators.
In what follows, for operator A we write A + for the corresponding Moore-Penrose pseudo-inverse and supp(A) for the support of A.
Lemma 2 (Balkır 2014). Let A, B be positive operators. We now develop an expression for the optimal k in terms of the matrices A and B.

Theorem 2. For positive self-adjoint matrices A, B such that:
the maximum k such that B − kA ≥ 0 is given by 1/λ where λ is the maximum eigenvalue of B + A.
Proof. We wish to find the maximum k for which ∀ |x〉 ∈ n . 〈x| (B − pA) |x〉 ≥ 0 Since supp(A) ⊆ supp(B), such a k exists. We assume that for k = 1, there is at least one |x〉 such that 〈x| (B − kA) |x〉 ≤ 0, since otherwise we're done. For all |x〉 ∈ n , 〈x| (B − kA) |x〉 increases continuously as k decreases. We therefore decrease k until 〈x| (B − kA) |x〉 ≥ 0, and there will be at least one |x 0 〉 at which 〈x 0 | (B − kA) |x 0 〉 = 0. These points are minima so that the vector of partial derivatives is a projector onto the support of B and supp(A) ⊆ supp(B), we have: where |v 0 〉 = B + B |x 0 〉, i.e., 1/k 0 is an eigenvalue of B + A. [ 243 ] Now, B + A has only non-negative eigenvalues, and in fact any pair of eigenvalue 1/k and eigenvector |v〉 will satisfy the condition B |v〉 = kA|v〉. We now claim that to satisfy ∀ |x〉 ∈ n . 〈x| (B − kA) |x〉 ≥ 0, we must choose k 0 equal to the reciprocal of the maximum eigenvalue λ 0 of B + A. For a contradiction, take λ 1 < λ 0 , so 1/λ 1 = k 1 > k 0 = 1/λ 0 . Then we require that ∀ |x〉 ∈ n . 〈x| (B − k 1 A) |x〉 ≥ 0, and in particular for |v 0 〉. However: We therefore choose k 0 equal to 1/λ 0 where λ 0 is the maximum eigenvalue of B + A, and 〈x| (B − k 0 A) |x〉 ≥ 0 is satisfied for all |x〉 ∈ n .
• Transitivity: k-hyponymy satisfies a version of transitivity. Suppose A k B and B l C. Then A kl C, since: by transitivity of the Löwner order. For the maximal values k max , l max , m max such that A k max B, B l max C and A m max C, we have the inequality m max ≥ k max l max . • Continuity: For A k B, when there is a small perturbation to A, there is a correspondingly small decrease in the value of k. The perturbation must lie in the support of B, but can introduce offdiagonal elements.

Theorem 3. Given A k B and density operator ρ such that supp(ρ) ⊆
supp(B), then for any ϵ > 0 we can choose a δ > 0 such that: Proof of Theorem 3. We wish to show that we can choose δ such that |k − k ′ | < ϵ. We use the notation λ max (A) for the maximum eigenvalue of A. A ′ = A + δρ satisfies the condition of Theorem 2, that [ 244 ] supp(A ′ ) ⊆ supp(B), since suppose |x〉 ̸ ∈ supp(B). supp(A) ⊆ supp(B), so |x〉 ̸ ∈ supp(A) and A|x〉 = 0. Similarly, ρ |x〉 = 0. Therefore (A+ ρ) |x〉 = A ′ |x〉 = 0, so |x〉 ̸ ∈ supp(A ′ ). By Theorem 2 we have: We may treat the denominator of (3) as a constant. We expand the numerator and apply Weyl's inequalities (Weyl 1912). These inequalities apply only to Hermitian matrices, whereas we need to apply these to products of Hermitian matrices. Since B + , A, and ρ are all real-valued positive semidefinite, the products B + A and B + ρ have the same eigenvalues as the Hermitian matrices A 1 2 B + A 1 2 and ρ 1 2 B + ρ 1 2 . Now: Therefore: so that given ϵ, A, B, we can always choose a δ to make k − k ′ ≤ ϵ.

5.4
Scaling When comparing positive operators, in order to standardize the magnitudes resulting from calculations, it is natural to consider normalizing their trace so that we work with density operators. Unfortunately, this is a poor choice when working with the Löwner order as distinct pairs of density operators are never ordered with respect to each other, i.e., for density operators σ, τ, σ ⊑ τ ⇒ σ = τ. Another option is to bound operators as having maximum eigenvalue 1, as suggested in D' Hondt and Panangaden (2006). With this ordering, the projection operators regain their usual ordering and we recover quantum logic as a suborder of our setting. [ 245 ] Our framework is flexible enough to support other normalization strategies. The optimal choice for linguistic applications is left to future empirical work. Other ideas are also possible. For example we can embed the Bayesian order (Coecke and Martin 2011) within our setting via a suitable transformation on positive operators as follows: 1. Diagonalize the operator, choosing a permutation of the basis vectors such that the diagonal elements are in descending order.
2. Let d i denote the i th diagonal element. We define the diagonal of a new diagonal matrix inductively as follows: Transform the new operator back to the original basis.
Further theoretical investigations of this type are left to future work.

Representing the order in the 'Bloch disc'
The Bloch sphere, Bloch (1946), is a geometrical representation of quantum states. Very briefly, points on the sphere correspond to pure states, and states within the sphere to impure states. Since we consider matrices only over 2 , we disregard the complex phase which allows us to represent the pure states on a circle. A pure state cos(θ /2) |0〉 + sin(θ /2) |1〉 is represented by the vector (sin(θ ), cos(θ )) on the circle.
We can calculate the entailment factor k between any two points on the disc. Figure 8 shows contour maps of the entailment strengths for the state with Bloch vector v = ( 3 4 sin(π/5), 3 4 cos(π/5)), using the maximum eigenvalue normalization. 6 results on compositionality This section provides results and examples on how the notion of hyponymy we have proposed interacts with the compositionality outlined in Section 2. We firstly give an example showing that phrases of different lengths can be compared. We then give a theorem and example to show that our notion of hyponymy 'lifts' to the sentence level, and that the k-values are preserved in a very intuitive fashion. 6.1

k-hyponymy in phrases of varying length
We can calculate the extent to which any pair of sentences or phrases are hyponyms of each other. We go back to the simple example in signifying that we are agnostic over all vectors with dimensions |blond〉, |brunette〉, |male〉. The adjective 'blond' is viewed as an operator which takes nouns to blond nouns. This is given by the following: Then if Carlos is described by the pure state |Carlos〉 = 1 2 (|blond〉 + |male〉) [ 247 ] we have ⟦Carlos⟧ = |Carlos〉 〈Carlos| k ⟦blond men⟧ for k = 4 9 by Theorem 2. For Janette described by the pure state |Janette〉 = 1 2 (|blond〉 + |female〉), we have ⟦Janette⟧ = |Janette〉 〈Janette| k ⟦blond men⟧ An obvious line of enquiry here is to consider how to build this type of adjective operator computationally. One strategy might be to extend the linear regression approach from Baroni and Zamparelli (2010) and , having built representations of 'noun' and the noun phrase 'blond noun'. Techniques for building density matrix representations of nouns are described in Sadrzadeh et al. (2018).

Sentence k-hyponymy
We can show that the application of k-hyponymy to various phrase types holds in the same way. In this section we provide a general proof for varying phrase types. We adopt the following conventions: • A positive phrase is assumed to be a phrase in which individual words are upwardly monotone in the sense described by (Barwise and Cooper 1981;MacCartney and Manning 2007). This means that, for example, the phrase does not contain any negations, including words like not.
• The length of a phrase is the number of words in it, not counting definite and indefinite articles.  [ 248 ] so k 1 · · · k n provides a lower bound on the extent to which φ(Φ) entails φ(Ψ). , . . . , n}. This means that for each i, we have positive matrices ρ i and nonnegative reals k i such that ⟦B i ⟧ = k i ⟦A i ⟧ + ρ i . Now consider the meanings of the two sentences. We have:

Proof of Theorem 4. First of all, we have
where P consists of a sum of tensor products of positive matrices, namely: where: Then we have: since P is a sum of tensor products of positive matrices, and φ is a completely positive map. Therefore: as required.
Intuitively, this means that if (some of) the words of a sentence Φ are k-hyponyms of (some of) the words of sentence Ψ, then this hyponymy is translated into sentence hyponymy. Upward-monotonicity is important here, in particular as introduced by some implicit quantifiers. It might be objected that dogs bark should not imply pets bark. If the implicit quantification is universal, then this is true, however [ 249 ] the universal quantifier is downward monotone in the first argument, and therefore does not conform to the convention concerning positive phrases. If the implicit quantification is existential, then some dogs bark does entail some pets bark, and the problem is averted. Discussion of the behaviour of quantifiers and other word types is given in, for example, Barwise and Cooper (1981) or MacCartney and Manning (2007).
The quantity k 1 · · · k n is not necessarily maximal, and indeed usually is not. As we only have a lower bound, zero entailment strength between a pair of components does not imply zero entailment strength between entire sentences.

there is strict entailment in each component. Then there is strict entailment between the sentences φ(Φ) and φ(Ψ).
Proof of Corollary 1.
We consider a concrete example. We can see that ⟦s 2 ⟧ − 1 4 ⟦s 1 ⟧ is positive by positivity of the individual elements and the fact that positivity is preserved under addition and tensor product. Therefore ⟦s 1 ⟧ kl ⟦s 2 ⟧ as required. 7 a toy experiment To investigate the effectiveness of the model we perform a toy experiment using a simplified version of the model. We use the dataset introduced in Balkır et al. (2016). This dataset consists of pairs of simple sentences annotated by humans as to whether the first sentence entails the second. Example pairs are: recommend development |= suggest improvement progress reduce |= development replace The first sentence is rated highly by humans for entailment, whereas the second has lower ratings. The sentences are either noun-verb or verb-noun, and they are of the same type within the pairs.
We use simplified models of composition which we detail as follows. The first model is a baseline, where we use only the verb to predict the entailment between the two sentences. For the second and third models, we use the notion of a Frobenius algebra. As described in Kartsaklis et al. (2012), we can 'lift' lower-order vectors and tensors to higher-order ones. This means that we can obtain a representation for the verb by lifting a density matrix representation. This has the important aspect that the dimensionality needed to represent the word is greatly reduced. In the category CPM(FHilb), there are two Frobenius algebras we can use. The first equates to a pointwise multiplication of the noun and the verb, and the second is expressed by where ρ(s), ρ(n), and ρ(v) indicate density matrices for the sentence, noun, and verb respectively.
The last model we examine is an additive model. In general, addition of two positive operators will not be a morphism in CPM(FHilb). However, in the particular case where the operators are density matrices, we can design a morphism that will implement addition. We give this morphism diagrammatically in Figure 9.
[ 252 ] + = + Figure 9: Morphism implementing addition of density matrices To build density matrices for the nouns and verbs, we firstly collect a set of hyponyms for each word. To do this, we use Word-Net (Miller 1995) via the Natural Language ToolKit (nltk) package in Python (Bird et al. 2009). We traverse the WordNet graph below each word to a depth of 8, and collect lemma names of every hyponym encountered. We then use GloVe vectors (Pennington et al. 2014) to build representations of each word as follows. Firstly, note that in fact the majority of the hyponyms encountered in WordNet were not present in the off-the-shelf GloVe dataset. Approximately 47,000 hyponyms were found across all words in the sentence pairs, of which approximately 10,000 were in the GloVe dataset. To build the density matrix representations for each word, we simply summed the density matrices corresponding to each GloVe vector for each hyponym of the word, and normalised. We added in some small random values along the diagonal, uniformly distributed over [0, 10 −3 ) and renormalised. This step is used to ensure that there is some minimal amount of entailment between every word. After creating sentence vectors from the composition of noun and verb vectors, we calculated the entailment using the result from Theorem 2. We ran the experiments over 50, 100, 200, and 300 dimension vectors. We judged the results by computing Spearman's ρ between the generated results and the mean of the human judgements. The best results were obtained with 50 dimensional vectors which we report in Table 2 Inter-annotator 0.66 - All the compositional models beat the verb-only baseline. The highest scoring model was the additive model, achieving close to interannotator agreement. Note that the sentences were extremely simple, and so it would be good to see how the commutative additive model fares when presented with more complex sentences. The best results from Balkır et al. (2016) were ρ = 0.66 for a vector-based model using the Spearman's ρ metric and our results are comparable. These vectors were built using part-of-speech information which our model did not use, so there is scope for improvement in that direction. 8 conclusion Integrating a logical framework with compositional distributional semantics is an important step in improving this model of language. By moving to the setting of density matrices, we have described a graded measure of hyponymy that may be used to describe the extent of hyponymy between two words represented within this enriched framework. This approach extends uniformly to provide hyponymy strengths between two phrases of the same type. That type can be any part of speech for which entailment makes sense, such as a noun phrase, verb phrase, or sentence. This includes pairs of phrases with differing numbers of words. We have also shown how a lower bound on hyponymy strength of phrases of the same structure can be calculated from their components.
Whilst we have given a means for modelling hyponymy in a compositional manner, and provided results on how hyponymy strengths compose, the task of integrating logical and distributional semantics is extremely wide-ranging. We mention here a number of areas to which we can start to contribute.
As mentioned in the introduction, some forms of crisp entailment are based in grammatical structure. So, for example, some adjectives interact with nouns to narrow down concepts, as in our example of 'blond men', and we therefore have that 'blond men' is a hyponym of 'men'. Other adjectives should not operate in this way, such as former in former president. This phenomenon is related to the notion of downward monotone contexts and the inclusion of negative words like not, or negative prefixes. At present, our model cannot effectively account for downward-monotone phenomena. In order to do so, additional structure, such as some form of involution, must be added to begin to model these phenomena.
The area of grammatical kinds of entailment also includes phenomena such as verb-phrase ellipsis. The framework developed here is all within the category of pregroups, and in order to be able to model more complex grammatical phenomena, we may need to move to other grammar categories. This has started to be developed in Kartsaklis et al. (2016) and we may therefore be able to use these methods within our current model.
The area of quantification is an important one. Hedges and Sadrzadeh (2016) have started to develop a theory of quantification within this framework, and so this is an area is which extension could be possible.
Another line of inquiry is to examine transitivity behaves. In some cases entailment can strengthen. We had that dog entails pet to a certain extent, and that pet entails mammal to a certain extent, but that dog completely entails mammal.
Our framework supports different methods of scaling the positive operators representing propositions. Empirical work will be required to establish the most appropriate method in linguistic applications.
acknowledgements Bob Coecke, Martha Lewis, and Dan Marsden gratefully acknowledge funding from AFOSR grant Algorithmic and Logical Aspects when Composing Meanings. Martha Lewis gratefully acknowledges funding from NWO Veni grant Metaphorical Meanings for Artificial Agents.