Embeddings as Epistemic States: Limitations on the Use of Pooling Operators for Accumulating Knowledge

Various neural network architectures rely on pooling operators to aggregate information coming from different sources. It is often implicitly assumed in such contexts that vectors encode epistemic states, i.e. that vectors capture the evidence that has been obtained about some properties of interest, and that pooling these vectors yields a vector that combines this evidence. We study, for a number of standard pooling operators, under what conditions they are compatible with this idea, which we call the epistemic pooling principle. While we find that all the considered pooling operators can satisfy the epistemic pooling principle, this only holds when embeddings are sufficiently high-dimensional and, for most pooling operators, when the embeddings satisfy particular constraints (e.g. having non-negative coordinates). We furthermore show that these constraints have important implications on how the embeddings can be used in practice. In particular, we find that when the epistemic pooling principle is satisfied, in most cases it is impossible to verify the satisfaction of propositional formulas using linear scoring functions, with two exceptions: (i) max-pooling with embeddings that are upper-bounded and (ii) Hadamard pooling with non-negative embeddings. This finding helps to clarify, among others, why Graph Neural Networks sometimes under-perform in reasoning tasks. Finally, we also study an extension of the epistemic pooling principle to weighted epistemic states, which are important in the context of non-monotonic reasoning, where max-pooling emerges as the most suitable operator.


Introduction
One of the key challenges in many sub-areas of Machine Learning is to learn suitable vector space embeddings of the objects of interest (e.g. graphs, images or sentences). A question which is usually left implicit is what the embedding of an object represents. We can take at least two different views on this. First, we may consider that embeddings essentially serve as a compact representation of a similarity relation. What matters, then, is that objects which are similar, in some sense, are represented by similar vectors, while objects which are dissimilar are not. This intuition provides the foundation, for instance, for the use of contrastive pre-training strategies [1,2]. Second, we may consider that embeddings are essentially compact encodings of epistemic states. In other words, the embedding of an object encodes what we know about that object. What matters, under this view, is the set of properties that are captured by an embedding. This view implicitly underpins most strategies that combine neural network learning with aspects of symbolic reasoning, e.g. when using a semantic loss function to encourage neural network predictions to satisfy certain constraints [3] or when using neural network predictions as input to a probabilistic logic program [4]. In this paper, we focus on this second view.
In practice, the embedding of an object is often obtained by combining the embeddings of related objects using some kind of pooling operator. For instance, in the context of Computer Vision, convolutional feature extractors such as ResNet [5] provide an embedding for each sub-region of the image. An embedding of the overall image is then typically obtained by averaging these sub-region embeddings. Along similar lines, a standard setting in Natural Language Processing consists in using a transformer based language model such as BERT [6] to obtain paragraph embeddings, and to average these embeddings to obtain an embedding for a full document. In multi-modal settings, it is common to obtain embeddings for the individual modalities first, and to subsequently aggregate these embeddings [7]. Graph Neural Networks [8,9] also crucially rely on pooling operators, learning node representations by aggregating embeddings derived from neighbouring nodes. Essentially, in all these cases we have an embedding e which is obtained by aggregating embeddings e 1 , ..., e k using some pooling operator : Under the epistemic view, this pooling operator is implicitly assumed to aggregate the knowledge that is captured by the embeddings e 1 , ..., e k . For instance, the embeddings e 1 , ..., e k may encode which objects are present in different parts of the image. After pooling these embeddings, we should end up with an embedding e that captures which objects are present throughout the entire image. Let us write Γ(e i ) for the knowledge that is captured by the embedding e i . More precisely, we will think of Γ(e i ) as a set of properties that are known to be satisfied. If we view pooling as the process of accumulating knowledge from different sources (e.g. from different regions of the image, or different neighbours in a graph neural network), then we would expect the following to be true: Γ(e) = Γ(e 1 ) ∪ ... ∪ Γ(e n ). We will refer to this principle as the epistemic pooling principle.
The main aim of this paper is to study under which conditions the epistemic pooling principle can be satisfied. Analysing a number of standard pooling operators, we find that the epistemic pooling principle can be satisfied for all of them, but with several important caveats: • We need at least as many dimensions as there are properties. In settings where we want to model propositional formulas, the properties of interest correspond to possible worlds. This means in particular that we need embeddings with as many dimensions as there are possible worlds.
• For most of the pooling operators, we find that embeddings need to be constrained in a particular way, e.g. by only allowing vectors with non-negative coordinates.
• We also identify important restrictions on how embeddings can be linked to the formulas they capture.
The fundamental question which we want to answer is whether a vector-based representation can act as a formal knowledge representation framework, or whether reasoning with neural networks is inevitably approximate in nature. While we focus on a theoretical analysis of pooling operators in this paper, our results provide a number of important insights for the design of neural network models for tasks that require reasoning. For instance, two operators emerge from our analysis as being particularly suitable for applications where we need to reason about propositional formulas: max-pooling, with the constraint that the coordinates of all embeddings are upper-bounded by some constant z, and the Hadamard operator (i.e. the component-wise product), with the constraint that all coordinates are non-negative. Our results also provide lessons for the design of Graph Neural Networks (GNN), in settings where such networks are used for learning to reason. For instance, GNNs typically aggregate the messages coming from their neighbours by averaging them. In such cases, we find that GNNs can only capture logical reasoning if the node embeddings are essentially binary. Finally, our lower bounds on the required dimensionality of epistemic embeddings offer support for modular approaches (e.g. representing what we know about each entity of interest by a separate vector, rather than using a single vector to represent our overall epistemic state). For instance, an important open challenge in Natural Language Processing is to design models that can reason about text beyond the paragraph level (i.e. beyond the maximal input length of standard transformer models). Combining text fragments by pooling paragraph embeddings (e.g. representing different paragraphs or fields from the same document, or representing information obtained from different sources) would require a prohibitively high dimensionality, if we want these embeddings to capture epistemic states in a faithful way.
The remainder of this paper is structured as follows. In the next section, we formalise the problem setting and introduce the notations that will be used throughout the paper. In Section 3, we then analyse under which conditions the epistemic pooling principle can be satisfied. One of the main findings from this section is that the requirement to satisfy the epistemic pooling principle fundamentally constrains which embeddings can be allowed, and how these embeddings encode knowledge. In Section 4, we then investigate how this impacts our ability to use vector embeddings for propositional reasoning. In particular, we focus on the problem of verifying whether a given propositional formula is satisfied in the epistemic state encoded by a given vector. Subsequently, in Section 5, we look at a generalisation of the epistemic pooling principle for dealing with weighted epistemic states. In this case, vectors encode the strength with which we believe a given property to be satisfied. Section 6 presents a discussion of our results in the context of related work, after which we summarise our conclusions.

Problem setting
We assume that epistemic states are represented using sets of elementary properties. Let us write P for the set of all these properties. An epistemic state Q then simply corresponds to a subset of P. Intuitively, we think of these elementary properties as atomic pieces of evidence. The properties in an epistemic state Q then correspond to the evidence that is available, whereas the properties in P \Q correspond to evidence that has not been encountered. For instance, in the context of image processing, we can think of the properties from P as elementary visual features, whose presence may be detected in an image. In applications where more intricate forms of reasoning are needed than simply aggregating sets of detected features, we can relate the properties in P to possible worlds. For each possible world ω, we can then consider a property p ω , corresponding to the knowledge that ω can be excluded, i.e. that ω is not a model of the world. In this way, subsets of P can be used to represent arbitrary propositional knowledge bases. This link with logical reasoning will be developed in Section 4. For now, however, it will suffice to simply think of epistemic states as subsets of P.
Taking the view that embeddings encode epistemic states, each e ∈ R n will be associated with a set of properties from P. Formally, we assume that a scoring function γ p : R n → R is available for each property p ∈ P. We consider two variants of our setting, which differ in whether strict or weak inequalities are used to determine which properties are satisfied. As we will see, this choice has a material impact on the theoretical properties of the resulting framework.
Strict semantics. Under the strict semantics, we say that an embedding e ∈ R n satisfies the property p ∈ P if γ p (e) > 0. Let us write Γ(e) for the epistemic state encoded by e, i.e. the set of properties satisfied by e: Let : R n × R n → R n represent a pooling operator. The epistemic pooling principle can then be formalised as follows: If we want to specify that the strict semantics is used, we will also refer to (3) as the strict epistemic pooling principle. Intuitively, the embeddings e and f capture information coming from two different sources, e.g. two different regions of an image or two different modalities. The principle captured by (3) is that the pooling operator should merely combine this information: the total evidence that is available is the union of the evidence provided by the two sources. Note that (3) is equivalent to: In the following, we will assume that all embeddings are taken from some set X ⊆ R n . One possibility would be to choose X = R n , but as we will see, it is sometimes necessary to make a more restrictive choice. For instance, we may have X = [0, +∞[ n if we want to restrict the discussion to vectors with non-negative coordinates. Regardless of how X is chosen, an important consideration is that the embeddings in X should allow us to capture every possible epistemic state, in the following sense: Finally, for the ease of presentation, we introduce the following notations: We will refer to Pos p and Neg p as the positive and negative regions for property p. Indeed, we have that e ∈ Pos p iff p ∈ Γ(e), and e ∈ Neg p otherwise.
Weak epistemic pooling principle. Under the weak semantics, we say that an embedding e ∈ R n satisfies the property p ∈ P if γ p (e) ≥ 0. Epistemic states are then determined as follows: This definition gives rise to the following counterpart of (3), which we will refer to as the weak epistemic pooling principle: We will furthermore require that the following counterpart to (5) is satisfied: Finally, the positive and negative regions are also defined analogously as before: Pooling operators. Whether the (strict or weak) epistemic pooling principle can be satisfied for every e, f ∈ X depends on the choice of the scoring functions γ p , the set X ⊆ R n and the pooling operator . In our analysis, we will focus on the following standard pooling operators: Average: (e 1 , ..., e n ) avg (f 1 , ..., f n ) = e + f 2 Summation: (e 1 , ..., e n ) sum (f 1 , ..., f n ) = e + f Max-pooling: (e 1 , ..., e n ) max (f 1 , ..., f n ) = (max(e 1 , f 1 ), ..., max(e n , f n )) Hadamard: (e 1 , ..., e n ) had (f 1 , ..., f n ) = (e 1 · f 1 , ..., e n · f n ) Note that the epistemic pooling principles in (3) and (7) are defined w.r.t. two arguments. This focus on binary pooling operators simplifies the formulation, while any negative results we obtain naturally carry over to pooling operators with more arguments. Moreover, most of the considered pooling operators are associative, with the exception of avg . Furthermore, even though avg itself is not associative, if it satisfies (3) or (7), its effect on the epistemic states encoded by the embeddings will nonetheless be associative, given that we have e.g. Γ(e 1 (e 2 e 3 )) = Γ((e 1 e 2 ) e 3 ) = Γ(e 1 ) ∪ Γ(e 2 ) ∪ Γ(e 3 ), due to the associativity of the union. We now illustrate the key concepts with a simple example.
Example 1. Let P = {a, b} and suppose embedding are taken from R 2 . Let the scoring functions γ a and γ b be defined as follows: Pooling Operator Semantics X = R n possible? Continuous γ p possible?  Now let e = ( 1 4 , 0) and f = ( 3 4 , 1). Then we have e avg f = ( 1 2 , 1 2 ). We find: This means that the epistemic pooling principle (3) is satisfied for e and f . On the other hand, for g = (10, 10), we have Γ(g) = Γ(e avg g) = ∅, hence the epistemic pooling principle is not satisfied for e and g.
Notations. Throughout this paper, we write δ(A) for the boundary of a set A ⊆ R n . Similarly, we will write int(A) and cl(A) for the interior and closure:

Realizability of the epistemic pooling principle
In this section we study, for each of the considered pooling operators, whether they can satisfy the strict and weak epistemic pooling principles, and if so, under which conditions this is the case. In all cases, we find that the epistemic pooling principles can only be satisfied if n ≥ |P|, with n the dimensionality of the embeddings. For avg and sum , we also have to make assumptions on the set X, i.e. the epistemic pooling principles cannot be satisfied for X = R n with these pooling operators. We furthermore find that the epistemic pooling principles can only be satisfied if γ p satisfies some particular conditions. Most significantly, we find that avg and sum cannot satisfy the weak epistemic pooling principle with continuous scoring functions γ p , and that had cannot satisfy the strict epistemic pooling principle with continuous scoring functions. These results are summarised in Table 1.

Average
Strict semantics. The first question we look at is whether the strict epistemic pooling principle (3) can be satisfied for all e, f ∈ R n . The following result shows that this is only possible in the trivial case where every embedding encodes the same epistemic state.
is satisfied for all embeddings e, f ∈ R n , with = avg . For any given p ∈ P we have (∀e ∈ R n . p ∈ Γ(e)) ∨ (∀e ∈ R n . p / ∈ Γ(e)) Proof. Suppose there exists some e ∈ R n such that p ∈ Γ(e). We show that we then have p ∈ Γ(f ) for every f ∈ R n . Noting that f = e avg (2f − e), we know from (3) that Γ(f ) = Γ(e) ∪ Γ(2f − e), and thus in particular that p ∈ Γ(f ).
We will thus have to assume that embeddings are restricted to some subset X ⊂ R n . To ensure that X is closed under the pooling operator avg , we will assume that X is convex. The following three lemmas explain how (3) constrains the scoring functions γ p .
is satisfied for all embeddings e, f ∈ X, with = avg . Suppose there exists some e ∈ X such that p ∈ Γ(e). It holds that Neg p ⊆ δ(X).
Proof. Suppose f ∈ int(X). We show that f ∈ Pos p . Let us define (λ ∈ R): Note that because we assumed that X is convex, it holds that x λ ∈ X for all λ ∈ [0, 1]. We have p ∈ Γ(x 1 ), as x 1 = e. By repeatedly applying (3) we find that p belongs to Γ(x 1 2 ), Γ(x 1 4 ), Γ(x 1 8 ), etc. In the limit, we find that for every λ ∈]0, 1] it holds that p ∈ Γ(x λ ). Since f ∈ int(X), there exists some 0 < ε ≤ 1 such that is satisfied for all embeddings e, f ∈ X, with = avg . Let p ∈ P. It holds that dim(Neg p ) ≤ n − 1 Lemma 2. Suppose (3) is satisfied for all embeddings e, f ∈ X, with = avg . Let p ∈ P. It holds that Pos p is convex.
Proof. Suppose e, f ∈ Pos p and define (λ ∈ [0, 1]): We show that x λ ∈ Pos p for every λ ∈]0, 1[. By applying (3) to f = x 0 and e = x 1 we find that x 1 2 ∈ Pos p . By applying (3) to x 0 and x 1 2 , we find x 1 4 ∈ Pos p . Similarly, by applying (3) to x 1 2 and x 1 , we find x 3 4 ∈ Pos p . Continuing in this way, we find x λ ∈ Pos p for every λ of the form j 2 i with i ∈ N and j ∈ {0, 1, ..., 2 i }. Now let λ ∈]0, 1[. We can approximate λ arbitrary well using a value of the form j 2 i . In particular, we can always find some i ∈ N and j ∈ {0, ..., 2 i } such that 0 < j 2 i < λ and λ < 2λ − j 2 i < 1. By (3), we have: Since we already know that x j 2 i ∈ Pos p we thus also find x λ ∈ Pos p .
is satisfied for all embeddings e, f ∈ X, with = avg . Let p ∈ P. It holds that Neg p is convex.
This means that there are at least two distinct values λ 1 , λ 2 ∈ {ε 1 , ε 2 , 2λ − ε 1 , 2λ − ε 2 } such that x λ1 ∈ Pos p and x λ2 ∈ Pos p . Let us assume w.l.o.g. that λ 1 < λ 2 . We can always find some i ∈ N and j ∈ {0, ..., 2 i } such that λ 1 < j 2 i < λ 2 . From the preceding discussion we already know that x j 2 i ∈ Neg p . However, from x λ1 ∈ Pos p and x λ2 ∈ Pos p , using Lemma 2 we find x j 2 i ∈ Pos p , a contradiction. It follows that x λ ∈ Neg p From Lemmas 1, 2 and 3, we obtain the following corollary using the hyperplane separation theorem.
is satisfied for all embeddings e, f ∈ X, with = avg . For any p ∈ P, there exists a hyperplane H p such that Neg p ⊆ δ(X) ∩ H p .
The next proposition reveals that the dimensionality of the embeddings needs to be at least |P| if we want the epistemic pooling principle to be satisfied and at the same time ensure that every epistemic state is modelled by some vector. (3) is satisfied for all embeddings e, f ∈ X, with = avg and X ⊆ R n . Suppose that (5) is satisfied. It holds that n ≥ |P|.

Proposition 2. Suppose
Proof. Let p 1 , ..., p |P| be an enumeration of the properties in P. Note that because of (5), we have that Neg p1 = ∅ and Pos p1 = ∅. Moreover, from Lemmas 2 and 3 we know that these regions are both convex. It follows from the hyperplane separation theorem that there exists a hyperplane H 1 which separates Neg p1 and Pos p1 . From Lemma 1, we furthermore know that Neg p1 ⊆ cl(Pos p1 ), which implies that Neg p1 ⊆ H 1 .
Note that H 1 ∩ Neg p2 and H 1 ∩ Pos p2 are convex regions. Moreover, since Neg p1 ⊆ H 1 , we find from (5) that H 1 ∩ Neg p2 = ∅ and H 1 ∩ Pos p2 = ∅. It follows from the hyperplane separation theorem that there exists some hyperplane H 2 separating H 1 ∩ Neg p2 and H 1 ∩ Pos p2 . Moreover, it holds that H 1 ∩ Neg p2 ⊆ H 2 . Indeed, suppose there was some e ∈ (H 1 ∩ Neg p2 ) \ H 2 and let f ∈ H 1 ∩ Pos p2 . For i ∈ N \ {0} we define f i = e avg f i−1 , with f 0 = f. Then there must be some i ∈ N \ {0} such that f i is on the same side of hyperplane H 2 as e, which implies f i ∈ Neg p2 since H 2 was chosen as a separating hyperplane. However, using (3) we also find that p 2 ∈ Γ(f i ) and thus f i ∈ Pos p2 , which is a contradiction. This means that H 1 ∩ Neg p2 ⊆ H 2 and thus in particular also that Neg p1 ∩ Neg p2 ⊆ H 2 .
It is easy to see that the bound from this proposition cannot be strengthened, i.e. that it is possible to satisfy (3) and (5) while n = |P|. One possible construction is as follows. Let p 1 , ..., p n be an enumeration of the properties in P. We define: and for i ∈ {1, ..., n} we define: γ pi (e 1 , ..., e n ) = e i To see why this choice satisfies (5), let Q ⊆ P and define e = (e 1 , ..., e n ) as follows: Then it is straightforward to verify that Γ(e) = Q. Moreover, it is also clear that (3) is satisfied. Indeed, the i th coordinate of (e 1 , ..., .., f n ). A visualisation of this construction is shown in Figure 1. Note how the construction aligns with the common practice of learning sparse high-dimensional embeddings with non-negative coordinates.
Weak semantics. Let us now consider whether the weak epistemic pooling principle can also be satisfied for avg . Without any restrictions on the scoring functions γ p , this is clearly the case. In particular, suppose for each p ∈ P, a function γ p is defined such that (3) is satisfied for all embeddings e, f ∈ X, for some convex set X, and suppose furthermore that (5) is satisfied. Then we can define modified scoring functions as follows: In particular, we have γ p (e) > 0 iff γ p (e) ≥ 0. The fact that the strict epistemic pooling principle is satisfied for the scoring functions γ p thus implies that the weak epistemic pooling principle is satisfied for the modified scoring functions γ p , while we can still model every epistemic state, i.e. (8) is also satisfied. The discontinuous nature of the scoring function γ p defined in (9) is clearly undesirable in practice. Hence the question arises whether it is possible to satisfy the weak epistemic pooling principle when only continuous scoring functions can be used. The answer to this question is negative. In particular, as the following result shows, if the weak epistemic pooling principle is satisfied with continuous scoring functions, all embeddings encode the same epistemic state. (7) is satisfied for all embeddings e, f ∈ X, for = avg . Suppose γ p is continuous. It holds that either Neg p = ∅ or Pos p = ∅.

Proposition 3. Suppose
Proof. Suppose there exists some e ∈ Neg p and f ∈ Pos p . We then have γ p (e) < 0 and γ p (f ) ≥ 0. For λ ∈ [0, 1] we define: By repeatedly applying (7), we find that γ p (x 1 2 i ) ≥ 0 for every i ∈ N. In particular, this means that for every ε > 0 there exists some x ∈ X such that d(e, x) < ε and γ p (x) ≥ 0. If γ p is continuous this implies γ p (e) = 0, which is a contradiction since we assumed e ∈ Neg p .

Summation
Strict semantics. We first show that (3) cannot be satisfied (in a non-trivial way) for all e, f ∈ R n when using sum , as we also found for avg in Proposition 1. (3) is satisfied for all embeddings e, f ∈ R n , with = sum . For any given p ∈ P we have (∀e ∈ R n . p ∈ Γ(e)) ∨ (∀e ∈ R n . p / ∈ Γ(e))
We thus again need to define a suitable subset X ⊆ R n . To ensure that X is closed under sum it is not sufficient that X is convex. For this reason, we will assume that X is conically closed, in particular: We now show that whenever (3) is satisfied for sum , for all e, f ∈ X, it also holds that (3) is satisfied for avg , meaning that the results we have established for avg carry over to sum . We first show the following lemma. (3) is satisfied for all embeddings e, f ∈ X, with = sum . Let p ∈ P and e ∈ X. If γ p (e) > 0 then it holds that γ p (λ e) > 0 for every λ > 0.

Proposition 5. Suppose
Proof. Suppose (3) is satisfied for all embeddings e, f ∈ X with = sum . Let e, f ∈ X be such that p ∈ Γ(e avg f ). In other words γ p ( 1 2 (e + f )) > 0. Using Lemma 4 we then find γ p (e + f ) > 0, and thus p ∈ Γ(e + f ). Since (3) is satisfied for sum we find p ∈ Γ(e) ∪ Γ(f ). Conversely, assume that p ∈ Γ(e) ∪ Γ(f ). Since (3) is satisfied for sum , this implies p ∈ Γ(e + f ), and using Lemma 4 we find p ∈ Γ( 1 2 (e + f )). Among others, it follows from Proposition 5 that whenever (3) is satisfied for all embeddings e, f ∈ X with = sum , we have that Pos p and Neg p are convex for every p ∈ P, and Neg p ⊆ δ(X). In particular, we also have the following corollary. (3) is satisfied for all embeddings e, f ∈ X, with = sum and X ⊆ R n . Suppose that (5) is satisfied. It holds that n ≥ |P|.

Corollary 3. Suppose
Weak semantics. In entirely the same way as Proposition 5, we can show the following result. (7) is satisfied for all embeddings e, f ∈ X with = sum . Then it also holds that (7) is satisfied for all embeddings e, f ∈ X with = avg .

Proposition 6. Suppose
This means that we have the same negative result as we found for avg . In particular, from Propositions 3 and 6, we immediately obtain the following corollary. (7) is satisfied for all embeddings e, f ∈ X, for = sum . Suppose γ p is continuous. It holds that either Neg p = ∅ or Pos p = ∅.

Max-pooling Strict semantics.
In contrast to what we found for avg and sum , when using max it is possible to satisfy (3) for all e, f ∈ R n in a non-trivial way. The main idea is illustrated in Figure 2. For reasons that will become clear in Section 4, in addition to the case where X = R n , we also consider the case where X =] − ∞, z] n for some z ∈ R. We now first show the following characterisation: whenever (3) is satisfied in a non-trivial way, we always have that Neg p is of the form ., x n ) ∈ X and y = (y 1 , ..., y n ) ∈ X be such that ∀i ∈ {1, ..., n} . x i ≤ y i . Then the following implication holds for every p ∈ P: Proof. Given that x max y = y, it follows from (3) that Γ(x) ⊆ Γ(y), from which we immediately find (11).
Proof. Let x, y ∈ cl(Neg p ). We show that for every ε > 0, there is some e ∈ Neg p such that d(x max y, e) < ε. Since x, y ∈ cl(Neg p ), there exist e x , e y ∈ Neg p such that d(e x , x) < ε √ n and d(e y , y) < ε √ n . Since e x , e y ∈ Neg p , by (3) we also have e x max e y ∈ Neg p . Moreover, we have, for e x = (e x,1 , ..., e x,n ), e y = (e y,1 , ..., e y,n ), x = (x 1 , ..., x n ) and y = (y 1 , ..., y n ): and thus d(e x max e y , x max y) < ε.
Proof. For each i ∈ {1, ..., n}, we can consider two cases: • Assume that the i th coordinate of the elements from cl(Neg p ) is bounded, i.e. there exists some b i ∈ R such that for each ( ., x n ) and (y 1 , ..., y n ) are elements from cl(Neg p ) which are maximal in the i th coordinate, i.e. for any ε > 0 we have ., x n ) / ∈ cl(Neg p ) and (y 1 , ..., y i−1 , y i + ε, y i+1 , ..., y n ) / ∈ cl(Neg p ). Assume furthermore that x i < y i . We have that (max(x 1 , y 1 ), ..., max(x n , y n )) ∈ cl(Neg p ) by Lemma 7, which implies (x 1 , ..., x i−1 , y, x i+1 , ...x n ) ∈ cl(Neg p ) by Lemma 6. However, this is in contradiction with the assumption we made about the i th coordinate of (x 1 , ..., x n ). It follows that there is a constant b i ∈ R such that for every ( • Now we consider the case where the i th coordinate of the elements from cl(Neg p ) is unbounded. Let (y 1 , ..., y n ) ∈ cl(Neg p ). We show that for any z ∈ R it holds that (y 1 , ..., y i−1 , z, y i+1 , ..., y n ) ∈ cl(Neg p ).
Putting these two cases together, find that cl(Neg p ) is of the form X ∩ (Y 1 p × ...Y n p ).
Using the characterisation from Proposition 7, we now show that embeddings with a minimum of |P| dimensions are needed to satisfy the epistemic pooling principle. (3) is satisfied for all embeddings e, f ∈ X, with = max . Suppose that (5) is satisfied. It holds that n ≥ |P|.
Proof. From Proposition 7 we know that for each property p ∈ P, it holds that cl(Neg p ) is of the form [. Let us write p < i q for p, q ∈ P to denote that one of the following cases holds: and there exists an element (x 1 , ..., x n ) ∈ Neg q such that x i = b while no such element exists in Neg p . In other words, the upper bound b for the i th coordinate is strict for Neg p but not for Neg q .
For each i ∈ {1, ...., n}, we can choose a property p i from P which is minimal w.r.t. the relation < i . Suppose there was some property q ∈ P \ {p 1 , ..., p n }. Let x = (x 1 , ..., x n ) ∈ Pos q . Then for some coordinate i, it must be the case that i represents a strict upper bound for the i th coordinate. Because p i was chosen as a minimal element w.r.t. < i it follows that (x 1 , ..., x n ) / ∈ Neg pi . We thus find that for every x ∈ Pos q it holds that x ∈ Pos p1 ∪ ... ∪ Pos pn . It follows that there is no x ∈ R n such that Γ(x) = {q}, meaning that (5) is not satisfied. To show that (3) and (5) can indeed be satisfied with |P| dimensions, let P = {p 1 , ..., p n } and define: Then we have that the i th coordinate of (e 1 , ..., (3) is satisfied for every e, f ∈ R n . To see why (5) is satisfied, let Q ⊆ P. We define q = (q 1 , ..., q n ) as follows: Weak semantics. As before, the main question is whether it is possible to satisfy (7) in a non-trivial way using continuous scoring functions γ p , since the results from the strict semantics trivially carry over to the weak semantics if non-continuous scoring functions are allowed. This is indeed the case. In fact, with the scoring functions defined in (12), the weak epistemic pooling principle is also satisfied. Moreover, in the same way as for the strict semantics, we find that (5) is satisfied for this choice. Finally, note that the lower bound n ≥ |P| still applies for the weak semantics, which can be shown in exactly the same way as Proposition 8.

Hadamard product
Strict semantics. Similar as we found for max-pooling, with the Hadamard product had , it is possible to satisfy (3) for every e, f ∈ R n , while also satisfying (5). In addition to the choice X = R n , we also consider the case where X = [0, +∞[ n . As we will see, the results we establish in this section are valid regardless of whether X = R n or X = [0, +∞[ n . The reason why we specifically include the case X = [0, +∞[ n will become clear in Section 4.
These hyperplanes play a particular role in the characterisation of the positive regions Pos p , as was already illustrated in Figure 3. The following results make this role explicit.
is satisfied for all embeddings e, f ∈ X, with = had . Let p ∈ P and e = (e 1 , ..., e n ) ∈ Pos p . Let I = {i ∈ {1, ..., n} | e i = 0}. It holds that We show that f ∈ Pos p . Let x = (x 1 , ..., x n ) be defined as follows: ..., f n ) ∈ i∈I H i it holds that f i = 0 whenever e i = 0. We thus have f = e had x. Using (3), it then follows from e ∈ Pos p that f ∈ Pos p .
It follows that Pos p is a finite union of regions of the form X ∩ i∈I H i . In particular, it also follows that dim(Pos p ) ≤ n − 1 if (3) is satisfied in a non-trivial way. For a given index set I ⊆ {1, ..., n}, let us define: (3) is satisfied for all embeddings e, f ∈ X, with = had . Let I, J ⊆ {1, ..., n}. It holds that:  (3) it follows that f ∈ Pos p or g ∈ Pos p . Thus, using Lemma 8 we find that p ∈ P I or p ∈ P J .
We can now show that at least |P| dimensions are again needed to satisfy (3) in a nontrivial way.
Proof. Given (5), for each p ∈ P, there must exist some (e p 1 , ..., e p n ) ∈ X such that Γ(e p 1 , ..., e p n ) = {p}. Let us fix such vectors (e p 1 , ..., e p n ) for each p ∈ P and define I p = {i ∈ {1, ..., n} | e p i = 0}. Note that by Lemma 8, we have X ∩ i∈Ip H i ⊆ Pos p . Moreover, since Γ(e p 1 , ..., e p n ) = {p}, we have X ∩ i∈Ip H i ⊆ Pos q for any q = p. In other words, we have P Ip = {p}.
For p = q we clearly have I p ⊆ I q , since I p ⊆ I q would imply P Iq = {p, q}. This implies in particular that I p = ∅ for every p ∈ P. Now let us consider k distinct properties p 1 , p 2 , ..., p k . Then we cannot have I p k ⊆ I p1 ∪I p2 ∪...∪I p k−1 . Indeed I p k ⊆ I p1 ∪...∪I p k−1 would imply p k ∈ P Ip 1 ∪...∪Ip k−1 whereas from Lemma 9 we know that P Ip 1 ∪...∪Ip k−1 = P Ip 1 ∪ ... ∪ P Ip k−1 and we know that the latter is equal to {p 1 , ..., p k−1 }. In other words, there is at least one element in I p k which does not occur in I p1 , ..., I p k−1 , a contradiction. Since this needs to hold for every k ∈ {2, ..., |P|}, there need to be at least |P| distinct elements in I p1 ∪ ... ∪ I p |P| . This means that n ≥ |P|.
To show that (3) and (5) can indeed be satisfied with |P| dimensions, let P = {p 1 , ..., p n } and define: Clearly, the i th coordinate of (e 1 , ..., e n ) had (f 1 , ..., f n ) is 0 iff e i = 0 or f i = 0, hence we indeed have Γ(e had f ) = Γ(e) ∪ Γ(f ), meaning that (3) is satisfied for every e, f ∈ R n . It is also straightforward to verify that (5) is satisfied. Note, however, that the scoring function γ pi defined in (13) is not continuous. As the following result shows, this is not a coincidence.
Proposition 10. Let X = R n or X = [0, +∞[ n . Suppose (3) is satisfied for all embeddings e, f ∈ X, for = had . Suppose γ p is continuous. It holds that either Neg p = ∅ or Pos p = ∅.
Proof. Suppose Pos p = ∅. Then from Lemma 8, we know that there is some I ⊆ {1, ..., n} such that X ∩ i∈I H i ⊆ Pos p . Let us assume that I is minimal, i.e. for any I ⊂ I we have X ∩ i∈I H i ⊆ Pos p . Now suppose we also have Neg p = ∅. Then we have I = ∅. Let i ∈ I and e 0 ∈ X ∩ i∈I H i . For ε > 0, let e ε be the vector obtained from e 0 by replacing the i th coordinate by ε. Since we assumed I was minimal, for any ε > 0 we have that e ε ∈ Neg p , or equivalently γ p (e ε ) ≤ 0. However, if γ p is continuous, this would imply γ p (e 0 ) ≤ 0 and thus e 0 ∈ Neg p , a contradiction.
The fact that only discontinuous scoring functions can be used is an important limitation in practice. One solution is to make a different choice for the set X.
It is straightforward to verify that, with this choice of scoring function, (7) is satisfied for all e, f ∈ R n , while (8) is also satisfied. Moreover, the same argument as in the proof of Proposition 9 can be used for the weak semantics as well, meaning that we still need at least |P| dimensions to satisfy (7) with = had in a non-trivial way.

Propositional reasoning with epistemic embeddings
Throughout this paper, we model epistemic states as subsets of P. In general, we can simply think of the elements of P as atomic pieces of evidence. Crucially, however, this setting is expressive enough to capture propositional reasoning. In particular, suppose each embedding e is associated with a set Λ(e) of propositional formulas. Now suppose the embedding g is obtained by pooling the embeddings e and f , e.g. representing the information we have obtained from two different sources: Then we want g to combine the knowledge captured by e and f . In other words, we want Λ(g) to be logically equivalent to Λ(e) ∪ Λ(f ). This gives rise to the following variant of the epistemic pooling principle: where we write ≡ to denote logical equivalence between sets of propositional formulas. In other words, in this setting, we want to be able to do propositional reasoning by pooling embeddings. This can be achieved using our considered setting as follows. Let W be the set of all possible worlds, i.e. the set of all propositional interpretations over some set of atoms At. We associate one property with each possible world: Intuitively, p ω means that ω can be excluded, i.e. that we have evidence that ω is not the true world. Let Clauses be the set of all clauses over the considered set of propositional atoms At. We can define Λ(e) in terms of the scoring functions γ p as follows: In other words, we have that α ∈ Λ(e) if α is true in all the worlds ω that cannot be excluded based on the evidence encoded by e. Note that Λ(e) is deductively closed, in the sense that α ∈ Λ(e) iff Λ(e) |= α. We can also straightforwardly show the following characterisation: Lemma 10. Suppose P is defined by (16) and Λ is defined by (17). Let p ω ∈ P. It holds that γ pω (e) ≤ 0 iff ω |= Λ(e).
We can now prove the following result, which shows that (15) arises as a special case of the (strict) epistemic pooling principle (3). (3) is satisfied for all embeddings e, f ∈ X, for some X ⊆ R n and some pooling operator such that X is closed under . Suppose P is defined by (16) and Λ is defined by (17). It holds that (15) is satisfied for all e, f ∈ X.

Proposition 11. Suppose
Proof. Let e, f ∈ X. By (3) we have that Γ(e f ) ⊇ e, hence for every ω we have: From the definition of Λ, it follows that Λ(e f ) ⊇ Λ(e). Since we similarly have Λ(e f ) ⊇ Λ(f ), we find Λ(e f ) ⊇ Λ(e) ∪ Λ(f ) and in particular: (3) we know that γ pω (e f ) ≤ 0 iff γ pω (e) ≤ 0 and γ pω (f ) ≤ 0, hence we obtain: Using Lemma 10 we find: ∀ω ∈ W . (ω |= Λ(e)) ∧ (ω |= Λ(f )) ⇒ (ω |= α) or, equivalently, ∀ω ∈ W . (ω |= Λ(e) ∪ Λ(f )) ⇒ (ω |= α). We thus find Λ(e) ∪ Λ(f ) |= α. Since this holds for every α ∈ Λ(e f ), we find Λ(e) ∪ Λ(f ) |= Λ(e f ) We can similarly model propositional reasoning using the weak semantics, by defining the set of formulas associated with an embedding e as follows The counterpart to Proposition 11 for the weak semantics is shown in entirely the same way. We can thus use the framework that was studied in Section 3 to combine, and reason about propositional knowledge. However, when we focus on knowledge that is encoded using propositional formulas, we also need an effective way to check whether a given formula α is entailed by the knowledge base Λ(e), i.e. whether the knowledge encoded by e is sufficient to conclude that α holds. To this end, for each propositional formula, we need a scoring function φ α : R n → R such that ψ α (e) indicates whether Λ(e) |= α. We now study such scoring functions.

Checking the satisfaction of propositional formulas
Let us consider scoring functions ψ α : R n → R, for arbitrary propositional formulas α, which satisfy the following condition: where we assume that Λ is defined as in (17). In other words, ψ α (e) > 0 holds is α is true in the epistemic state encoded by e. Let us write M(α) for the models of a formula α, where M(α) ⊆ W. We find from the definition of Λ that Λ(e) |= α is equivalent with In other words, to check the satisfaction of a propositional formula α, we need a scoring function that allows us to check whether γ p (e) > 0 for every p in some subset of properties Q ⊆ P. In particular let us define a scoring function γ Q for every Q ⊆ P such that Then we have: The scoring functions of the form ψ α are thus a special case of scoring functions of the form γ Q . For generality, we will study the latter type of scoring functions in the remainder of this section. This also has the advantage that we can stay closer to the setting from Section 3. Analogously to Pos p and Neg p , we now define the following regions: Similarly, under the weak semantics, we can consider scoring functions of the form γ Q , with Q ⊆ P, which are linked to the scoring functions γ p as follows: The corresponding positive and negative regions are defined as follows Clearly, if the scoring functions γ p are continuous, then continuous scoring functions of the form γ Q must also exist, as we can simply define γ Q (e) = min q∈Q γ q (e), and similar for the weak semantics. Our main focus in the remainder of this section is on the following question: is it possible for the scoring functions γ Q and γ Q to be linear? In other words, can we use linear scoring functions to check the satisfaction of a propositional formula in the epistemic state encoded by a vector e? This question is important because of the prevalence of linear scoring functions in the classification layer of neural networks. Our results are summarised in Table 2. One important finding is that linear scoring functions of the form γ Q are not compatible with the strict semantics, regardless of how the embeddings were obtained (and thus also regardless of which pooling operator is considered). For the weak semantics, we find that linear scoring functions are possible with max and had , but crucially, this is only the case if the set of embeddings X is bounded in a suitable way.

Pooling Operator
Semantics Linear γ Q possible?

Summation Strict Weak
Max-pooling

Linear scoring functions under the strict semantics
In the following, we assume that embeddings belong to some region X ⊆ R n , which we assume to be convex. We make no assumptions about how the embeddings are obtained, requiring only that (5) is satisfied, i.e. for every epistemic state, there exists some embedding e ∈ X which encodes it. We show that scoring functions of the form γ Q can then not be linear. Before showing the main result, we first prove three lemmas.
Lemma 11. Suppose (5) is satisfied. Let Q = {p 1 , ..., p k } be a subset of P. Assume that γ Q is linear and suppose e ∈ X is such that γ Q (e) = 0. For every ε > 0 there exists some f ∈ X such that d(e, f ) < ε while γ Q (f ) > 0.
Proof. Note that the linearity of γ Q means that there exists some hyperplane H which separates the regions Pos Q and Neg Q , where γ Q (e) = 0 means that e ∈ H. If e ∈ int(X) then the claim is trivially satisfied, as there exist vectors f on either side of the hyperplane which are arbitrarily close to e. Now assume e ∈ δ(X). Suppose there were an ε > 0 such that for every f ∈ X satisfying d(e, f ) < ε, it holds that γ Q (f ) ≤ 0. It then follows, given the convexity of X, that H is a bounding hyperplane of X, and in particular that X ⊆ Neg Q . This is a contradiction, given that we assumed that (5) is satisfied.
Lemma 12. Suppose (5) is satisfied. Let Q = {p 1 , ..., p k } be a subset of P. Assume that γ Q is linear and that γ p1 , ..., γ p k are all continuous. For every e ∈ X we have: Proof. Suppose γ Q (e) = 0 and suppose γ pi (e) < 0 for some i ∈ {1, ..., k}. Because γ pi is continuous, there must be some ε > 0 such that γ pi (f ) < 0 for every f satisfying d(e, f ) < ε. However, by Lemma 11 we know that there must be such an f for which γ Q (f ) > 0, which implies γ pi (f ) > 0, a contradiction.   Proof. Figure 4 illustrates the argument provided in this proof. Suppose k > 1. Let H Q be the hyperplane defined by H Q = {e | γ Q (e) = 0}. Let H i similarly be the hyperplane corresponding to γ pi . Because of (5) there exists some e ∈ X such that Γ(e) = {p 1 , ..., p k }. Moreover, for each i ∈ {1, ..., k} there exists some f i ∈ X such that Γ(f i ) = {p 1 , ..., p i−1 , p i+1 , ..., p k }. Then we have γ Q (e) > 0 and γ Q (f i ) ≤ 0 for every i ∈ {1, ..., k}. For each i, we let g i be the point on the intersection between H Q and the line defined by e and f i . Note that g i must exist, since γ Q (e) > 0 and γ Q (f i ) ≤ 0. Moreover, we have that g i ∈ X since X is convex. Since γ p1 , ..., γ p k are linear, the fact that γ pj (e) > 0 and γ pj (f i ) > 0, for j = i, implies that γ pj (g i ) > 0. Since this hold for every j = i while γ Q (g i ) = 0 it follows from Lemma 13 that γ pi (g i ) = 0. Now consider g * = 1 k (g 1 + ... + g k ). By the convexity of X, we have g * ∈ X. Moreover, since γ pi (g j ) > 0 if i = j and γ pi (g j ) = 0 otherwise, if k > 1 we have that γ pi (g * ) > 0 for every i ∈ {1, ..., k}. This would mean that Γ(g * ) = {p 1 , ..., p k } and thus that we should have γ Q (g * ) > 0. However, since g 1 , ..., g k ∈ H Q , we also have g * ∈ H Q and thus γ Q (g * ) = 0, a contradiction. It follows that k = 1.

Linear scoring functions under the weak semantics
In Section 4.2, we found that the strict semantics is not compatible with the use of linear scoring functions of the form γ Q . We now explore whether linear scoring functions can be more successful under the weak semantics. We already know from Section 3 that avg and sum are not compatible with the use of continuous scoring functions under the weak semantics. We therefore focus on the remaining pooling operators, although we will return to avg in Section 4.4.
We now show that (21) cannot be satisfied with linear scoring functions if we choose X = R n . In Proposition 7 we found that cl(Neg p ) is of the form While this result was shown for the strict semantics, the same argument can be made for the weak semantics, i.e. if (7) is satisfied for every e, f ∈ R n , then we have that cl(Neg p ) is of the form Y 1 p × ... × Y n p . Clearly, a region Neg p of this form can only arise from a linear scoring function γ p if Y i p =] − ∞, +∞[ for all but one i from {1, ...n}. Note that there must exist some i ∈ {1, ..., n} such that Y i p is of the form ] − ∞, b i ] to avoid the trivial case where Neg p = R n , which would imply that epistemic states in which p is known cannot be modelled. If γ p is linear, for p ∈ P, we thus find that there must exist some i ∈ {1, ..., n} and some b i ∈ R such that Note that we have a strict inequality in (22)  with ReLU(a) = max(0, a) the rectified linear unit. Note that γ pi (e 1 , ..., e n ) ≥ 0 iff ReLU(−e i ) = 0 iff e i ≥ 0. With this choice, we thus have that Neg pi is of the form (22). It is easy to verify that (7) is satisfied for every e, f ∈ R n , and that (8)

Hadamard product
For the Hadamard product had we find that linear scoring functions can be used under the weak semantics, provided that we choose X = [0, +∞[ n . To see this, let P = {p 1 , ..., p n } and let the scoring functions γ pi be defined as follows: γ pi (e 1 , ..., e n ) = −e i It is straightforward to verify that (7) is then satisfied for every e, f ∈ [0, +∞[ n , while (8) is also satisfied. For Q = {p i1 , ..., p i k } we define: Then, for (e 1 , ..., e n ) ∈ [0, +∞[ n , we have γ Q (e 1 , ..., e n ) ≥ 0 iff e i1 = ... = e i k = 0 iff γ pi ≥ 0 for every ∈ {1, ..., k}. Thus we find that (21) is indeed satisfied. In Section 3.4, we also considered the case where X = R n . Unfortunately, for this choice of X, scoring functions of the form γ Q can only be linear in the trivial case where |Q| = 1. (7) is satisfied for every e, f ∈ R n , with = had . Suppose (8) is satisfied. Let Q = {p i1 , ..., p i k } be a subset of P. Suppose γ pi 1 , ..., γ pi k and γ Q are all linear. Then we have |Q| = 1.

Proposition 14. Suppose that
Proof. Let H Q be the hyperplane associated with γ Q , i.e. H Q = {e | γ Q (e) = 0}, and let H 1 , ..., H k similarly be the hyperplanes associated with γ pi 1 , ..., γ pi k . Clearly, for each e ∈ H Q we have γ pi 1 (e) ≥ 0, ..., γ pi k (e) ≥ 0. This is only possible if the hyperplanes H Q , H 1 , ..., H k are all parallel. Given that we assumed that (8) is satisfied, this is only possible if p i1 = ... = p i k . Similar as we found for max-pooling, it is possible to satisfy (21) for X = R n by using a non-linear activation function. In this case, the aim of this activation function is to map all vectors e ∈ R n to some vector in [0, +∞[ n , where we know that linear scoring functions are possible. In particular, we can use the following: It is easy to verify that (21) is indeed satisfied for this choice, while (7) is satisfied for every e, f ∈ R n and (8) is also satisfied.

Reasoning with average pooling
In Section 4.2 we found that linear scoring functions cannot be used under the strict semantics, whereas in Section 3.1 we found that for avg , continuous scoring functions cannot be used under the weak semantics. Hence, there does not appear to exist a straightforward mechanism to use average pooling in line with the idea of combining epistemic states. This theoretical finding seems to be at odds with the prevalence and empirical success of models that rely on average pooling. One possible argument is that the epistemic pooling principle is perhaps too restrictive. While it should not be the case that e avg f captures properties that are not captured by e or f , we only need this principle to hold for the embeddings e, f that we are likely to encounter in practice. In particular, due to the way neural networks are trained, embeddings satisfying some property p are typically separated by some margin from embeddings which do not. Formally, let ∆ > 0 represent a given margin. Then we can expect that embeddings e will either satisfy γ p (e) ≤ 0 or γ p (e) ≥ ∆. If this is the case for every property p, we can think of e as representing a clear-cut epistemic state. In contrast, when 0 < γ p (e) < ∆, we can think of e as being ambiguous regarding the satisfaction of p. Let us write X * ⊆ X for the clear-cut epistemic states, i.e. As in Section 3.1, we find that with this choice, (3) is satisfied for all e, f ∈ X while (5) is also satisfied. Note furthermore that X * = ({0} ∪ [∆, +∞[) n . For Q = {p i1 , ..., p i k } we define: Then (20) is satisfied for any (e 1 , ..., e n ) ∈ X * . Indeed, for (e 1 , ..., e n ) ∈ X * we find: where we used the fact that either ReLU(∆ − e i ) = 0 or ReLU(∆ − e i ) = ∆, given that (e 1 , ..., e n ) ∈ X * . Note that the ReLU transformation converts (e 1 , ..., e n ) into a vector that is binary, in the sense that each coordinate is either 0 or 1. Using the sigmoid function σ we can similarly ensure that (20) is satisfied for any (e 1 , ..., e n ) ∈ X * , by constructing vectors that are approximately binary in the aforementioned sense. For instance, we could define: where µ is an arbitrary constant satisfying 0 < µ < 1 and λ > 0 is chosen in function of µ. In particular, by choosing a sufficiently large value for λ, we can always ensure that Given the assumption that λ is sufficiently large, and given that (e 1 , ..., e n ) ∈ X * , this inequality is satisfied iff e i ≥ ∆ for every ∈ {1, ..., k}. The rest of the argument then follows as before. We can also use linear scoring functions, by defining X * such that embeddings are approximately binary. In particular, let X = [0, 1] n with ∆ = 1 − ε for some ε ∈]0, 1 k [. Let γ pj (e 1 , ..., e n ) = e j as before. For Q = {p i1 , ..., p i k }, we define: We then have γ Q (e 1 , ..., e n ) > 0 iff k =1 e i > k − 1 Given our assumption that ε < 1 k , this inequality is satisfied iff e i ≥ ∆ = 1 − ε for every ∈ {1, ..., k}. The rest of the argument then follows as before.

Modelling weighted epistemic states
We now consider a setting in which each property p from P is associated with a certainty level from Λ = {0, 1, ..., K}. Intuitively, a certainty level of 0 means that we know nothing about p, whereas a level of K means that we are fully certain that p is true. When aggregating evidence from different sources, we assume that the certainty level of p is given by the certainty level of the most confident source, in accordance with possibility theory [10]. This setting has the advantage that we can continue to view pooling operators in terms of accumulating knowledge.
Reduction to the standard setting. It is possible to model weighted epistemic states within the standard framework we have considered so far, using scoring functions of the form γ Q . We consider the strict semantics here, but an entirely similar argument can be made for the weak semantics. Let us define the set of extended properties as follows: For each p i ∈ P , we let γ p i be a scoring function satisfying (3) for all embeddings e, f ∈ X. We can interpret p i as encoding that the certainty level of p is not equal to i. For p ∈ P and i ∈ Λ \ {0} we define: Let γ Q (p,i) be a scoring function satisfying (20). Then γ Q (p,i) (e) > 0 means that all certainty levels below i can be excluded for p, i.e. that p is certain at least to degree i.
Weighted epistemic pooling principle. The construction from the previous paragraph has an important drawback: the number of properties is increased (K + 1)-fold. From Section 3, we know that this implies that the number of dimensions also has to increase (K + 1)-fold. If for every p and i, we need the ability to model that the certainty of p is not i, this increase in dimensionality is inevitable. However, in practice, we are typically not interested in excluding arbitrary sets of certainty degrees, only in establishing lower bounds on certainty degrees. To study this setting, we introduce a generalisation of the epistemic pooling principle to weighted epistemic states. Let us write p, i for the fact that the certainty level of p is at least i, where p, 0 means that nothing is known about p whereas p, K means that p is known with full certainty. We write Λ 0 = {1, 2, ..., K} for the set of non-trivial lower bounds. We furthermore assume that the certainty level of p is determined by the scoring function γ p : R n → R. In particular, under the strict semantics, we assume that p is known with certainty at least i, for i ∈ Λ 0 , if γ p > i − 1. We then consider the following generalisation of the epistemic pooling principle: We define Γ Λ as a set of weighted properties, encoding for each property what is the highest certainty degree with which this property is believed: The condition that every weighted epistemic state is modelled by some vector e ∈ X can then be formalised as follows: In other words, for every assignment of certainty degrees to the properties in P, there exists an embedding e ∈ X that encodes the corresponding epistemic state. We also introduce the following notations for i ∈ Λ 0 :  Similarly, we can also consider a weighted version of the weak epistemic pooling principle: The weighted epistemic state associated with an embedding e ∈ X is now defined as follows: while the counterpart of (26) becomes We now analyse whether the weighted epistemic pooling principles (24) and (27) can be satisfied in a nontrivial way for embeddings with |P| dimensions. We find that this is only the case for max , as shown in Table 3.

Realizability of the weighted epistemic pooling principle
Average. In entirely the same way as in Section 3.1, we then find that Pos p,i and Neg p,i are convex, for every p ∈ P and i ∈ Λ 0 . Similarly, we also find that Neg p,i ⊆ δ(X) for every p ∈ P and i ∈ Λ 0 . Since Pos p,i \ Pos p,K ⊆ Neg p,K , we thus also have that Pos p,i \ Pos p,K ⊆ δ(X). We can now show the following result using a similar strategy as in the proof of Proposition 2. (24) is satisfied for all embeddings e, f ∈ X, with = avg and X ⊆ R n . Suppose that (26) is satisfied. It holds that n ≥ |P| · K.
In Section 3.1, we found that the epistemic pooling principle cannot be satisfied in a non-trivial way under the weak semantics, if continuous scoring functions are used. This result carries over to the weighted setting.
Summation. Entirely similar as in Section 3.2, we find that when the weighted epistemic pooling principle is satisfied for all e, f ∈ X and = sum , then it is also satisfied for = avg . It thus follows from Proposition 15 that |P| · K dimensions are needed to satisfy (24) in a non-trivial way. Moreover, we also have that the weighted epistemic pooling principle under the weak semantics, i.e. (27), cannot be satisfied when continuous scoring functions are used.
Max-pooling. For max , we can satisfy the weighted epistemic pooling principles (24) and (27) for every e, f ∈ R n , while ensuring that every weighted epistemic state is encoded by some vector. Let P = {p 1 , ..., p n } and let γ pi be defined as follows: γ pi (e 1 , ..., e n ) = e i It holds that (24) is satisfied for each e, f ∈ R n . Indeed, for every j ∈ Λ 0 we have: In entirely the same way, we find that the weak weighted epistemic pooling principle (27) is satisfied. We now show that both (26) and (29) are satisfied. Let µ : P → Λ. Then we can define e = (e 1 , ..., e n ) as follows: It is trivial to verify that Γ Λ (e) = Γ Λ (e) = { p, µ(p) | p ∈ P}.
Hadamard product. As in Section 3.4, we find that Pos p, is a finite union of regions of the form . For a given index set I ⊆ {1, ..., n}, we define: In entirely the same way as in Lemma 9, we find for every I, J ⊆ {1, ..., n} that: The following lemma is also shown in entirely the same way as Lemma 8.
Proof. Given (26), for each p ∈ P and ∈ Λ 0 , there must exist some (e p, 1 , ..., e p, n ) ∈ X such that Γ(e p, 1 , ..., e p, n ) = { p, } ∪ { q, 0 | q ∈ P \ {p}}. Let us fix such vectors (e p, 1 , ..., e p, n ) for each p ∈ P and ∈ Λ 0 . Define I p, = {i ∈ {1, ..., n} | e p, i = 0}. Note that by Lemma 14, we have X ∩ i∈I p, H i ⊆ Pos p, . Moreover, by construction, we have X ∩ i∈I p, H i ⊆ Pos q, for any q = p and ∈ Λ 0 , and similarly, we have X ∩ i∈I p, H i ⊆ Pos p, for any > . In other words, we have that all elements in P Λ I p, are of the form (p, ) with ≤ . For p = q and , ∈ Λ 0 , we clearly have I p, ⊆ I q, , since I p, ⊆ I q, would imply p, ∈ P I q, . This means in particular that I p, = ∅ for every p ∈ P and ∈ Λ 0 .
The above limitation also applies to the weak semantics. However, if we choose X = [0, 1] n it is possible to satisfy (24) for n properties if K = 2 (i.e. if we have three certainty degrees). Indeed, let P = {p 1 , ..., p n }. Then we can define: Then it is straightforward to verify that (24) is indeed satisfied for every e, f ∈ [0, 1] n and that (26) also holds. The above construction also provides an example of how the weighted epistemic pooling principle can be satisfied for the weak semantics. Indeed, with the above definition of γ pi we have that (27) is satisfied for every e, f ∈ [0, 1] n , while (29) also holds. In Section 3.4, we found that the weak epistemic pooling principle could be satisfied with continuous scoring functions when = had . Unfortunately, this strategy does not allow us to obtain a continuous alternative to the scoring functions defined in (31).
6. Discussion and related work

Logical reasoning with neural networks
The use of neural networks for simulating symbolic reasoning has been extensively studied under the umbrella of neuro-symbolic reasoning [11]. The seminal KBANN method [12], for instance, uses feedforward networks with carefully chosen weights to simulate the process of reasoning with a given rule base. In this case, the neural network simulates a fixed rule base, which is manually specified. More recent work has investigated how a neural network can be trained to simulate the deductive closure of a given logical theory, e.g. a description logic ontology [13]. The idea that rule-based reasoning can be simulated using neural networks also lies at the basis of various strategies for learning rules from data [14,15,16]. Another recent research line has focused on whether standard neural network architectures, such as LSTMs or transformer based language models [6], can be trained to recognise logical entailment [17,18]. In these works, the input consists of a premise (or a set of premises) and a hypothesis, and the aim is to predict whether the hypothesis can be inferred from the premise(s). The aforementioned works differ from this paper, as our focus is not on whether neural networks can simulate logical reasoning, but on whether vectors can be used for encoding epistemic states. Simulating logical reasoning does not necessarily require that vectors can encode epistemic states, since we can treat reasoning as an abstract symbol manipulation problem. Moreover, the scope of what we address in this paper goes beyond logical reasoning, as the epistemic pooling principle also matters whenever we need to combine evidence from different sources (e.g. features being detected in different regions of an image).
This paper builds on our earlier work [19], where the focus was on the following question: given a set of attributes A, a pooling function and a propositional knowledge base K, can we always find an embedding a and a scoring function γ a for every a ∈ A such that: for any a 1 , ..., a k , b ∈ A. This analysis fundamentally differs from the results in this paper, because in the case of [19], we only care about the behaviour of the pooling operator and scoring function for a finite set of vectors, i.e. the attribute embeddings. In the notations of this paper, this amounts to limiting X to a finite set of carefully chosen embeddings. As a result, for instance, in [19] it was possible to use average pooling in combination with continuous scoring functions, under the weak semantics, something which we found to be impossible in the more general setting considered in this paper.

Reasoning with Graph Neural Networks
In Section 4, we have specifically focused on reasoning in the context of propositional logic. However, our analysis is relevant for reasoning in relational domains as well. A standard approach for relational reasoning with neural networks is to rely on Graph Neural Networks (GNNs) [8,20]. Given a graph G = (V, E), with V a set of nodes and E ⊆ V × V a set of edges, a GNN aims to learn a vector representation of every node in V . This is achieved by incrementally updating the current representation of each node based on the representations of their neighbours. In particular, let us write v (i) for the represention of node v ∈ V in layer i of the GNN. Let {u 1 , ..., k } be the set of neighbours of v in G. Then the representation of v in layer i + 1 is typically computed as follows: The functions f 1 , f 2 , f 3 , 1 and 2 can be defined in various way. However, regardless of the specifics, we can think of 1 and 2 as pooling operators, whereas f 1 , f 2 and f 3 correspond to (possibly non-linear) transformations. Note that we used a non-standard notation in (32) to highlight the connection to this paper. Intuitively, we can think of f 3 (u (i) j ) as a vector that captures what we can infer about the entity represented by node v from the fact that it is connected to u j . In multi-relational settings (e.g. knowledge graphs), where different types of edges occur, f 3 can be replaced by a function that depends on the edge type. The pooling operator 2 is used to aggregate the evidence coming from the different neighbours of v, whereas 1 is used to combine the evidence we already have about v with the evidence we can obtain from its neighbours.
The ability of GNNs to simulate logical reasoning has been studied in [21]. The idea is that each node is associated with a set of properties, which are referred to as colours. We can then consider first-order formulas involving unary predicates, referring to these colours, and the binary predicate E, which captures whether two nodes are connected. For instance, consider the following formula: φ 1 (x) ≡ Green(x) ∧ (∃y . E(x, y) ∧ Blue(y)) ∧ ¬(∃y . E(x, y) ∧ Red(y)) This formula is true for a given node if it is green and it is connected in the graph to a blue node but not to a red node. We can also consider counting quantifiers, as in the following example: This formula is true for a given node if it is connected to at least 5 blue nodes. The question studied in [21] is which formulas can be recognised by a GNN, i.e. for which class of formulas φ can we design a GNN such that we can predict whether φ holds for a node n from the final-layer embedding of that node, using some scoring function. In particular, it was shown that the set of formulas that can be recognised (without global read-out) are exactly those that are expressible in graded modal logic [22], which is characterised as follows: 1. for each color C, the formula C(x) is a graded modal logic formula; 2. if φ(x) and ψ(x) are graded modal logic formulas and n ∈ N, then the following formulas are also graded modal logic formulas: ¬φ(x), φ(x) ∧ ψ(x) and ∃ ≥n y . E(x, y) ∧ φ(y).
The proof that was provided for the characterisation in [21] is constructive. It relies on the particular choice of 2 as summation, which appears to be at odds with the limitations that were identified for sum in this paper. However, the GNN in their construction only uses binary coordinates. As we have seen in Section 4.4, in that case, avg can be used for propositional reasoning, a result which straightforwardly carries over to sum . This observation may help to explain the discrepancy between the theoretical ability of GNNs to capture arbitrary formulas from graded logic, and the challenges that have empirically been observed when using GNNs for learning to reason. For instance, GNNs have generally failed to outperform simpler embedding based methods for the task of knowledge graph completion [23], while [24] found that GNNs were limited in their ability to generalise in a systematic way from examples that were more complex than the ones seen during training. Interestingly, [25] recently proposed a GNN for knowledge graph completion in which 2 corresponds to max-pooling. Together with a number of other design choices (e.g. avoiding negative weights and using a particular encoding of the knowledge graph), this leads to GNNs that are in some sense equivalent to a set of rules. The suitability of max-pooling, in this context, is not a surprise, given our results from Section 4. Note, however, that our results also suggest that coordinates have to be upper-bounded if we want to identify cases where sets of atomic properties are jointly satisfied. In the approach from [25], this issue is avoided by using high-dimensional embeddings in which each candidate inference corresponds to a separate coordinate, which amounts to treating the formulas of interest as atomic properties in our framework.

Modelling relations as regions
A popular strategy for making predictions in relational domains consists in learning (i) an embedding e ∈ R n for each entity of interest e and (ii) scoring function f r : R n × R n → R for each relation r such that f r (e, f ) indicates the probability that the fact r(e, f ) holds. Despite the fact that such methods intuitively carry out some form of logical inference, and despite their strong empirical performance [23], for most approaches, there is no clear link between the parameters of the model (i.e. the embeddings and the parameters of the scoring functions), on the one hand, and the kinds of inferences that are captured, on the other hand. Region-based methods, however, are a notable exception [26,27,28,29,30]. The central idea of such methods is to represent predicates as regions. For instance, if s is a unary predicate, then the corresponding region R s is such that e ∈ R s iff s(e) is true. For a binary predicate r, one option is to use a region R r in R 2n such that e ⊕ f ∈ R r iff r(e, f ) is true. In other words, we view the concatenation of e and f as the embedding of the tuple (e, f ) and model binary predicates as regions over such concatenations. The key advantage of region-based models is that logical relationships can be directly encoded, in terms of spatial relationships between the region-based representations of the predicates involved. As a simple example, the rule r 1 (x, y) → r 2 (x, y) is satisfied if R r1 ⊆ R r2 . We refer to [26] for details on how more complex rules can be similarly captured. This correspondence between logical dependencies and spatial relationships can be used to ensure that the predictions of the model are in accordance with a given knowledge base, or to explain the behaviour of a model in terms of the logical rules it captures. Moreover, region based embeddings make it possible to query embeddings of knowledge bases in a principled way [28,29]. Essentially, a given query (e.g. "retrieve all companies whose headquarter is in a European capital city") is then mapped onto a region, such that the entities that satisfy the query are those whose embedding belongs to the region.
However, it should be noted that the aforementioned region-based embeddings encode a specific possible world, rather than an epistemic state. In other words, they encode which facts are assumed to be true and false, but they cannot encode incomplete knowledge (e.g. that either r(a, b) or s(a, b) holds). On the other hand, the box embeddings proposed in [31,32] can be used to model some epistemic states. For instance, an approach for capturing uncertain knowledge graph embeddings based on box embeddings was proposed in [33]. In this case, the idea is to represent the entities themselves as regions, and as axis-aligned hyperboxes in particular. The problem of pooling such hyperbox representations has not yet been considered, to the best of our knowledge. The most intuitive approach would be to simply take the intersection of the boxes, i.e. if entity e is represented by a hyperbox B 1 , according to one source, and by a hyperbox B 2 , according to another source, we may want to use B 1 ∩ B 2 as an aggregate representation of entity e, reflecting the information provided by both sources. However, this leads to a number of practical challenges. For instance, if box embeddings are used to parameterise a probabilistic model, as in [33], then it is unclear whether a sound justification can be provided for a pooling operation that relies intersecting the entity-level box representations. Such probabilistic models also serve a rather different purpose to the framework that we studied in this paper, which is about accumulating knowledge rather than about quantifying uncertainty. Even in settings where the box embeddings are used as a purely qualitative representation, a problem arises when the region B 1 ∩ B 2 is empty. Intuitively, box embeddings act as constraints on possible worlds, and such constraints can be inconsistent. This is different from the settings we studied in this paper, which were about accumulating knowledge, formalised as sets of properties.

Conclusions
Neural networks are often implicitly assumed to perform some kind of reasoning. In this paper, we have particularly focused on the common situation where evidence is obtained from different sources, which then needs to be combined. The core question we addressed is whether it is possible to represent the evidence obtained from each source as a vector, such that pooling these vectors amounts to combining the corresponding evidence, a requirement we refer to as the epistemic pooling principle. This question is important for understanding whether, or under which conditions, neural networks that rely on pooling are able to perform reasoning in a principled way. Our analysis shows that standard pooling operators can indeed be used for accumulating evidence, but only under particular conditions. Broadly speaking, the requirement that the epistemic pooling principle is satisfied substantially limits how knowledge can be encoded. For instance, when average pooling is used, we find that embeddings have to be limited to a strict subset X of R n , and that vectors which encode that some property p is not satisfied have to be located on a bounding hyperplane of X. We also highlighted how such conditions limit the way in which embeddings can be used. For instance, when average pooling is used, it is not possible to use linear scoring functions for checking whether a given propositional formula is satisfied in the epistemic state encoded by a given vector. In general, our results provide valuable insights for the design of neural networks that are required to implement some form of systematic reasoning.