Analyzing Differentiable Fuzzy Logic Operators

The AI community is increasingly putting its attention towards combining symbolic and neural approaches, as it is often argued that the strengths and weaknesses of these approaches are complementary. One recent trend in the literature are weakly supervised learning techniques that employ operators from fuzzy logics. In particular, these use prior background knowledge described in such logics to help the training of a neural network from unlabeled and noisy data. By interpreting logical symbols using neural networks, this background knowledge can be added to regular loss functions, hence making reasoning a part of learning. We study, both formally and empirically, how a large collection of logical operators from the fuzzy logic literature behave in a differentiable learning setting. We find that many of these operators, including some of the most well-known, are highly unsuitable in this setting. A further finding concerns the treatment of implication in these fuzzy logics, and shows a strong imbalance between gradients driven by the antecedent and the consequent of the implication. Furthermore, we introduce a new family of fuzzy implications (called sigmoidal implications) to tackle this phenomenon. Finally, we empirically show that it is possible to use Differentiable Fuzzy Logics for semi-supervised learning, and compare how different operators behave in practice. We find that, to achieve the largest performance improvement over a supervised baseline, we have to resort to non-standard combinations of logical operators which perform well in learning, but no longer satisfy the usual logical laws.


Introduction
In recent years much work has been published on the integration of symbolic and statistical approaches to Artificial Intelligence (AI) (Garcez et al., 2012;Besold et al., 2017). This development is partly inspired by critiques on the statistical method deep learning (Marcus, 2018;Pearl, 2018), which has been the focus of the AI community in the last decade. While deep learning has brought many important breakthroughs in computer vision (Brock et al., 2018), natural language processing (Devlin et al., 2018) and reinforcement learning (Silver et al., 2017), the concern is that progress will be halted if its shortcomings are not dealt with. Among these is the massive amounts of data that deep learning needs to effectively learn a concept. On the other hand, symbolic AI can reuse concepts and knowledge using only a small amount of data (e.g. a single logical statement). Additionally, it is easier to interpret the decisions of symbolic AI as the explicit symbols refer to concepts that have a clear meaning to humans, while deep learning uses complex mathematical models using millions or billions of numerical parameters. Finally, it is much easier to describe background knowledge using symbolic AI and to integrate it into such a system.
A major downside of symbolic AI is that it is unable to capture the noisiness and ambiguity of sensory data. It is difficult to precisely express how small changes in the input data should produce different outputs. This is related to the symbol grounding problem which Harnad (1990) defines as how "the semantic interpretation of a formal symbol system can be made intrinsic to the system, rather than just parasitic on the meanings in our heads". Symbols refer to concepts that have an intrinsic meaning to us humans, but computers manipulating these symbols can not trivially understand this meaning. In contrast to symbolic AI, a properly trained deep learning model excels at modeling complex sensory data. These models could bridge the gap between symbolic systems and the real world. Therefore, several recent approaches (Diligenti et al., 2017b;Garnelo et al., 2016;Serafini and Garcez, 2016;Manhaeve et al., 2018;Evans and Grefenstette, 2018) aim to interpret symbols that are used in logic-based systems using deep learning models. These are some of the first systems to implement a proposition going back 20 years from Harnad (1990), namely "a hybrid nonsymbolic/symbolic system (...) in which the elementary symbols are grounded in (...) non-symbolic representations that pick out, from their proximal sensory projections, the distal object categories to which the elementary symbols refer."

Reasoning and Learning using Gradient Descent
We introduce Differentiable Fuzzy Logics (DFL). DFL integrates reasoning and learning by using logical formulas which express background knowledge. The symbols in these formulas are interpreted using a deep learning model of which the parameters are to be learned. DFL constructs differentiable loss functions based on these formulas that can be minimized using gradient descent. This ensures that the deep learning model acts in a manner that is consistent with the background knowledge as we can backpropagate towards the deep learning model parameters.
In order to ensure loss functions are differentiable, DFL uses fuzzy logic semantics (Klir and Yuan, 1995). Predicate, function and constant symbols are interpreted using the deep learning model. By maximizing the degree of truth of the background knowledge using gradient descent, both learning and reasoning are performed in parallel.
By adding the loss function of DFL to other loss functions commonly used in deep learning, DFL can be used for more challenging machine learning tasks than purely supervised learning. These methods fall under the umbrella of weakly supervised learning (Zhou, 2017). For example, it becomes possible to detect noisy or inaccurate supervision by correcting inconsistencies between the labels, the model's predictions and the background knowledge (Donadello et al., 2017). A promising application is semi-supervised learning in which only a limited fraction of the dataset is labeled, and a large part is unlabeled (Xu et al., 2018;Hu et al., 2016). This is done by correcting the predictions of the deep learning model when it is logically inconsistent.
In this paper, we present an analysis of the choice of operators used to compute the logical connectives in DFL. For example, functions called t-norms are used to connect two fuzzy propositions (Klir and Yuan, 1995). Because they return the degree of truth of the event that both propositions are true, such t-norms generalizes the Boolean conjunction. Similarly, a fuzzy implication generalizes the Boolean implication. Most of these operators are differentiable, allowing them to be used in DFL. Interestingly, the derivatives of these operators determine how DFL corrects the deep learning model when its predictions are inconsistent with the background knowledge. We will show that the qualitative properties of these derivatives are integral to both the theory and practice of DFL.

Contributions
The main question that we aim to answer in this work is: "which fuzzy logic operators for aggregation, conjunction, disjunction and implication have convenient theoretical properties when using them in gradient descent?". We analyze both theoretically and empirically the effect of the choice of operators used to compute the logical connectives in Differentiable Fuzzy Logics on the learning behaviour of a DFL system. To this end, • we introduce several known operators from fuzzy logic (Section 3) and the framework of Differentiable Fuzzy Logics (Section 4) that uses these operators; • we analyze the theoretical properties of four types of operators (Section 5): Aggregation functions, which are used to compute the universal quantifier ∀, conjunction and disjunction operators, which are used to compute the connectives ∧ and ∨, and fuzzy implications which are used to compute the connective →; • we perform experiments to compare these fuzzy logic operators in a semi-supervised experiment (Section 9).
• We conclude with several recommendations for choices of operators.

Differentiable Logics
Differentiable Logics (DL) are logics for which differentiable loss functions can be constructed that represent logical formulas. These logics use background knowledge to deduce the truth value of statements in unlabeled or poorly labeled data. This allows us to use such data during learning, possibly together with normal labeled data. This can be beneficial as unlabeled, poorly labeled and partially labeled data is cheaper and easier to come by. Importantly, this approach differs from Inductive Logic Programming (Muggleton and de Raedt, 1994) which derives rules from data. DL is the other way around: the logic informs what the truth values of the statements could have been.
We motivate the use of Differentiable Logics with the following scenario: Assume we have an agent A whose goal is to describe the scene on an image. It gets feedback from a supervisor S, who does not have an exact description of these images available. However, S does have a background knowledge base K, encoded in some logical formalism, about the concepts contained on the images. The intuition behind Differentiable Logics is that S can correct A's descriptions of scenes when they are not consistent with its knowledge base K. Example 1. We illustrate this idea with the following example. Agent A has to describe the image I in Figure  Suppose that K contains the following logic formula which says objects that are a part of a chair are either cushions or armrests: ∀x, y chair(x) ∧ partOf(y, x) → cushion(y) ∨ armRest(y).
S might now reason that since A is relatively confident of chair(o 1 ) and partOf(o 2 , o 1 ) that the antecedent of this formula is satisfied, and thus cushion(o 2 ) or armRest(o 2 ) has to hold. Since p(cushion(o 2 )|I, o 2 ) > p(armRest(o 2 )|I, o 2 ), a possible correction would be to tell A to increase its degree of belief in cushion(o 2 ).
We would like to automate the kind of supervision S performs in the previous example. To this end, we identify a significant family of Differentiable Logics in the literature that are based on fuzzy logic: We call it Differentiable Fuzzy Logics (DFL). Examples of logics in this family are Real Logic (Serafini and Garcez, 2016), the rather similarly named Deep Fuzzy Logic (Marra et al., 2019b), and the logics underlying Semantic Based Regularization (Diligenti et al., 2017b), LYRICS (Marra et al., 2018) and KALE (Guo et al., 2016). We compare these logics in Section 10.1. The objective of DFL is to maximize the satisfaction of the full grounding of a fuzzy knowledge base. To this end, truth values of ground atoms are not discrete but continuous, and logical connectives are interpreted using some function over these truth values.

Logic
We assume that the basic syntax and semantics of first-order logic is familiar. We will denote predicates using the sans serif font, for example cushion, variables by x, y, z, x 1 , ... and objects by o 1 , o 2 , ...,. For convenience, we will be limiting ourselves to function-free formulas in prenex normal form. Functions in prenex normal form start with quantifiers followed by a quantifier-free subformula. An example of a formula in prenex form is ∀x, y P(x, y) ∧ Q(x) → R(y). An atom is P(t 1 , ..., t m ) where t 1 , ..., t m are terms. If t 1 , ..., t m are all constants, we say it is a ground atom.
Fuzzy logic is a real-valued logic where truth values are real numbers in [0,1] where 0 denotes completely false and 1 denotes completely true. Fuzzy logic models the concept of vagueness by arguing that the truth value of many propositions can be noisy to measure, or subjective. We will be looking at predicate fuzzy logics in particular. Predicate fuzzy logics extend propositional fuzzy logics with universal and existential quantification.

Operators for Conjunction, Disjunction and Aggregation
In this section, we will introduce the semantics of the fuzzy operators ∧ (conjunction), ∨ (disjunction) and ¬ (negation) that are used to connect truth values of fuzzy predicates, and the semantics of the ∀ quantifier. We follow (Jayaram and Baczynski, 2008) in this section and refer to it for proofs and additional results.
left-continuous if for all arbitrarily small > 0 there exists another value δ > 0 such that for all a ∈ D it holds that |f (x) − f (a)| < whenever a − δ < x < a, and similarly for right-continuous; , and similarly for decreasing; d) strictly increasing if for all a, b ∈ D, if a < b then f (a) < f (b), and similarly for strictly decreasing.
Left-continuity informally means that when a point is approached from the left, no 'jumps' will occur.
Moreover, a function f :

Fuzzy Negation
The functions that are used to compute the negation of a truth value of a formula are called fuzzy negations.
In this paper we will exclusively use the strict and strong classic negation N C (a) = 1 − a.

Triangular Norms
The functions that are used to compute the conjunction of two truth values are called t-norms. For a rigorous overview, see Klement et al. (2013). The phrase 'T (a, ·) is increasing' means that whenever 0 Definition 4. A t-norm T can have the following properties: a) Continuity: A continuous t-norm is continuous in both arguments. b) Left-continuity: A left-continuous t-norm is left-continuous in both arguments. c) Idempotency: An idempotent t-norm has the property that for all a ∈ [0, 1], T (a, a) = a. d) Strict-monotony: A strictly monotone t-norm has the property that for all a ∈ [0, 1], T (a, ·) is strictly increasing. e) Strict: A strict t-norm is continuous and strictly monotone. Table 1 shows the four basic t-norms and two other t-norms of interest alongside their properties. Name

Triangular Conorms
The functions that are used to compute the disjunction of two truth values are called t-conorms or s-norms. T-conorms are obtained from t-norms using De Morgan's laws from classical logic, i.e. p ∨ q = ¬(¬p ∧ ¬q). Therefore, if T is a t-norm and N C the classical negation, T 's N C -dual S is calculated using Table 2 shows several common t-conorms derived using Equation 1 and the t-norms from Table 1, alongside the same optional properties as those for t-norms in Definition 4.

Aggregation operators
The functions that are used to compute quantifiers like ∀ and ∃ are aggregation operators (Liu and Kerre, 1998).
Definition 6. An aggregation operator is a function A : [0, 1] n → [0, 1] that is symmetric and increasing with respect to each argument, and for which A(0, ..., 0) = 0 A(1, ..., 1) = 1. A symmetric function is one in which the output value is the same for every ordering of its arguments.
Aggregation operators are variadic functions which are functions that are defined for any finite set of arguments.
For this reason we will often use the notation A n i=1 x i := A(x 1 , ..., x n ). Table 3 shows some common aggregation operators that we will talk about.
The ∀ quantifier is interpreted as the conjunction over all arguments x. Therefore, we can extend a t-norm T from 2-dimensional inputs to n-dimensional inputs as they are commutative and associative (Klement et al., 2013): These operators are a straightforward choice for modelling the ∀ quantifier, as they can be seen as a series of conjunctions. We can do the same for a t-conorm S to model the ∃ quantifier:

Fuzzy Implications
The functions that are used to compute the truth value of p → q are called fuzzy implications. p is called the antecedent and q the consequent of the implication. We follow Jayaram and Baczynski (2008) and refer to it for details and proofs.

S-Implications
In classical logic, the (material) implication is defined as follows: Using this definition, we can use a t-conorm S and a fuzzy negation N to construct a fuzzy implication.
Definition 9. Let S be a t-conorm and N a fuzzy negation. The function I S,N : [0, 1] 2 → [0, 1] is called an (S, N)-implication and is defined for all a, c ∈ [0, 1] as If N is a strong fuzzy negation, then I S,N is called an S-implication (or strong implication).
As we only consider the classical negation N C , we omit the N and use I S to refer to I S,N C All S-implications I S are fuzzy implications and satisfy LN, EP and R-CP. Additionally, if the negation N is strong, it satisfies CP and if, in addition, it is strict, it also satisfies L-CP. In Table 4 we show several S-implications that use the strong fuzzy negation N C and the common t-conorms ( Table 2). Note that Simplications are rotations of the t-conorms.

R-Implications
R-implications are another way of constructing implication operators. They are the standard choice in t-norm fuzzy logics.
Name T-norm R-implication Properties The supremum of a set A, denoted sup{A}, is the lowest upper bound of A. All R-implications are fuzzy implications, and all satisfy LN, IP and EP. T is a left-continuous t-norm if and only if the supremum can be replaced with the maximum function. Note that if a ≤ c then I T (a, c) = 1. We can see this by looking at Equation 5. The largest value for b possible is 1, since then, using the neutrality property of t-norms, T (a, 1) = a ≤ c. Table 5 shows the four R-implications created from the four common T-norms. Note that I LK and I F D appear in both tables: They are both S-implications and R-implications.

Differentiable Fuzzy Logics
We next discuss Differentiable Fuzzy Logics (DFL), a family of Differentiable Logics based on fuzzy logic. Truth values of ground atoms are continuous, and logical connectives are interpreted using fuzzy operators. In principle, DFL can handle both predicates and functions. To ease the discussion, we will not analyze functions, constants and existential quantifiers and thus leave them out of the discussion. 1 We follow both Real Logic in Serafini and Garcez (2016) and embeddings based semantics in Guha (2015) as an introduction.

Semantics
DFL defines a new semantics using vector embeddings and functions on such vectors in place of classical semantics. In classical logic, a structure consists of a domain of discourse and an interpretation function, and is used to give meaning to the predicates. Similarly, in DFL a structure consists of a probability distribution defined on an embedding space and an embedded interpretation: 2 Definition 11. A Differentiable Fuzzy Logics structure is a tuple p, η θ , where p is a domain distribution over d- which is a function parameterized by θ so that for all predicate symbols P ∈ P with arity α, η θ (P) : To address the symbol grounding problem (Harnad, 1990), objects in DFL semantics are d-dimensional vectors of reals. Their semantics come from the underlying semantics of the vector space as terms are interpreted in a real (valued) world (Serafini and Garcez, 2016). Likewise, predicates are interpreted as functions mapping these vectors to a fuzzy truth value. The domain distribution is used to limit the size of the vector space. For example, p might be the distribution over images representing only the natural images. Embedded interpretations can 1 Existential quantification can be modeled in a similar way to universal quantification, but by using operators generalizing t-conorms instead (similar to how t-norms are used to model universal quantification). Furthermore, functions and constants are modelled in Serafini and Garcez (2016) and Marra et al. (2018). 2 Serafini and Garcez (2016) uses the term "(semantic) grounding" or "symbol grounding" (Mayo, 2003) instead of 'embedded interpretation', "to emphasize the fact that L is interpreted in a 'real world'" but we find this potentially confusing as this could also refer to groundings in Herbrand semantics. Furthermore, by using the word 'interpretation' we highlight the parallel with classical logical interpretations. 3 Without loss of generality we fix the dimensionality of the vectors representing the objects. Extensions to a varying number of dimensions are straightforward by introducing types. be implemented using any deep learning model 4 . Different values of the trainable parameters θ will produce different interpretations η θ and so we include θ in the notation.
Next, we define how to compute the truth value of sentences of DFL.
Definition 12. A variable assignment µ maps variable symbols x to objects o ∈ O. µ(x) retrieves the object o ∈ O assigned to x in µ.
Definition 13. Let p, η θ be a DFL structure, N a fuzzy negation, T a t-norm, S a t-conorm, I a fuzzy implication and A an aggregation operator. Then the valuation function e η θ ,p,N,T,S,I,A (or, for brevity, e θ ) computes the truth value of a formula ϕ in L given a variable assignment µ. It is defined inductively as follows: Equation 6 defines the fuzzy truth value of an atomic formula. l finds the objects assigned to the terms x 1 , ..., x m resulting in a list of d-dimensional vectors. These are the inputs to the interpretation of the predicate symbol η θ (P) to get a fuzzy truth value. Equations 7 -10 define the truth values of the connectives using the operators N, T, S and I. Finally, Equation 11 defines the truth value of universally quantified formulas ∀x φ. This is done by enumerating the domain of discourse o ∈ O, computing the truth value of φ with o assigned to x in µ, and combining the truth values using an aggregation operator A.

Relaxing Quantifiers
For infinite domains, or for domains that are so large that we cannot compute the full semantics of the ∀ quantifier, we can choose to sample a batch of b objects from O to approximate the computation of the valuation. This can be done by replacing Equation 11 with An obvious way would be to sample from the domain distribution p, if is available. It is commonly assumed in Machine Learning Goodfellow et al. (2016)(p.109) that a dataset D contains independent samples from the domain distribution p and thus using such samples approximates sampling from p. Unfortunately, by relaxing quantifiers in this way we lose soundness of the logic.

Learning using Fuzzy Maximum Satisfiability
In DFL, the parameters θ are learned using fuzzy maximum satisfiability (Donadello et al., 2017), which finds parameters that maximize the valuation of the knowledge base K.
Definition 14. Let K be a knowledge base of formulas, p, η θ a DFL structure for the predicate symbols in K and e η θ ,p,N,T,S,I,A a valuation function. Then the Differentiable Fuzzy Logics loss L DF L of a knowledge base of formulas K is computed using where w ϕ is the weight for formula ϕ which denotes the importance of the formula ϕ in the loss function. The fuzzy maximum satisfiability problem is the problem of finding parameters θ * that minimize Equation 13: This optimization problem can be solved using a gradient descent-like method. If the operators N, T, S, I and A are all differentiable, we can repeatedly apply the chain rule, i.e. reverse-mode differentiation, on the DFL loss L DF L (θ n ; O, K), n = 0, ..., N . This procedure finds the derivative with respect to the truth values of the ground atoms ∂L DF L (θn;O,K) ∂η θn (P)(o1,...,om) . We can use these partial derivatives to update the parameters θ n , resulting in a different embedded interpretation η θn+1 . This procedure is computed as follows: where is the learning rate. We refer for implementation details to Appendix A.
Example 2. To illustrate the computation of the valuation function e θ , we return to the problem in Example 1. The domain of discourse is the set of subimages of natural images. The domain distribution is a distribution over those subimages. We take {o 1 , o 2 } as the batch for the aggregation operator. The valuation of the formula ϕ = ∀x, y chair(x) ∧ partOf(y, x) → cushion(y) ∨ armRest(y) is computed as: Next, we choose the operators as T = T P , S = S P , A = A T P and I = I RC . The computation of the valuation function can then be written as If we interpret the predicate functions using a lookup in the table on the probabilities from Example 1 so that η θn (P(x)) = p(P(x)|I, x), we find that e θn (ϕ) = 0.612. Taking K = {ϕ}, we find using repeated applications of the chain rule that We can now do a gradient update step to update the probabilities in the table of probabilities from Example 1, or find what the partial derivative of the parameters θ n of some deep learning model p θn should be using Equation 15. One particularly interesting property of Differentiable Fuzzy Logics is that the partial derivatives of the subformulas with respect to the satisfaction of the knowledge base have a somewhat explainable meaning. For example, as hypothesized in Example 1, the computed gradients reflect that we should increase p(cushion(o 2 )|I, o 2 ), as it is indeed the (absolute) largest partial derivative.

Derivatives of Operators
We will now show that the choice of operators that are used for the logical connectives actually determines the inferences that are done when using DFL. If we used a different set of operators in Example 2, we would have gotten very different derivatives. These could in some cases make more sense, and in some other cases less. Furthermore, it is much easier to find a global minimum of the fuzzy maximum satisfiability problem (Equation 14) for some operators than for others. This is often because of the smoothness of the operators. In this section, we analyze a wide variety of functions that can be used for logical reasoning and present some of their properties that determine how useful they are in inferences such as those illustrated above.
We will not discuss any varieties of fuzzy negations since the classical negation N C (a) = 1 − a is already continuous, intuitive and has simple derivatives.
Definition 15. A function f : R → R is said to be nonvanishing if f (a) = 0 for all a ∈ R, i.e. it is nonzero everywhere. A function f : R n → R has a nonvanishing derivative if for all a 1 , ..., a n ∈ R there is some 1 ≤ i ≤ n such that ∂f (a1,...,an) ∂ai = 0.
Whenever an operator vanishes, it loses its learning signal. Notice that derivatives of composites of nonvanishing functions can still be vanishing. For instance, using the product t-conorm for a ∨ ¬a, we find that the derivative of S P (a, 1 − a) is 0 at 1 2 . Furthermore, all the partial derivatives of the connectives used in the backward pass from the valuation function to the ground atoms have to be multiplied. If the partial derivatives are less than 1, their product will also approach 0.
The drastic product T D and operators derived from it such as the drastic sum S D and the Dubois-Prade and Weber implications (I DP and I W B ) have vanishing derivatives almost everywhere. In deep learning models, output probabilities are the result of transformations on real numbers using functions like the sigmoid or softmax that result in truth values in (0, 1). The operators derived from T D only have nonvanishing derivatives when their inputs are exactly 0 or 1, making them not very useful for this application.
Definition 16. A function f : R n → R is said to be single-passing if for all x 1 , ..., x n ∈ [0, 1] it holds that A single-passing function has nonzero derivatives on at most one input argument. Using just single-passing Fuzzy Logic operators can be inefficient, since then at most one input will have a nonzero derivative (i.e. a learning signal), yet the complete forward pass still has to be computed to find this input. Proposition 1. Any composition of single-passing functions is also single-passing.
For the proof, see Appendix B.1. Concluding, for any logical operator to be usable in the learning task, it will need to have a nonvanishing derivative at the majority of the input signals (so that it can contribute to the learning signal at all), and ideally to not be single-passing (so that it can contribute efficiently to the learning signal).

Aggregation
After the global considerations from the previous section, we next analyze in detail each aggregation operator separately and outline their benefits and disadvantages. As explained above,we limit ourselves to aggregation functions for universal quantification only.

Minimum Aggregator
The minimum aggregator is given as A T G (x 1 , ..., x n ) = min(x 1 , ..., x n ). The partial derivatives are It is single-passing with the only nonzero gradient being on the input with the lowest truth value. Many practical formulas have exceptions. An exception to a formula like ∀x Raven(x) → Black(x) would be a raven over which a bucket of red paint is thrown. The minimum aggregator would have a derivative on that exception when 'red raven' is correctly predicted. Additionally, it is inefficient, as we still have to compute the forward pass for inputs that do not get a feedback signal.

Lukasiewicz Aggregator
The Lukasiewicz aggregator is given as The gradient is nonvanishing only when n i=1 x i > n−1, that is, when the average value of x i is larger than n−1 n (Páll Jónsson, 2018). As lim n→∞ n−1 n = 1, for larger values of n, all inputs have to be true for this condition to hold.
For the next proposition, we refer to the fraction of inputs for which some condition holds. The probability that the condition holds for a point uniformly sampled from [0, 1] n is this fraction.
Proposition 2. The fraction of inputs x 1 , ..., x n ∈ [0, 1] for which the derivative of A T LU is nonvanishing is 1 n! . For proof, see Appendix B.2.1. Clearly, for the majority of inputs there is a vanishing gradient, implying that this aggregator would not be useful in a DFL learning setting.

Yager Aggregator
The Yager aggregator is given by Where p = 1 this corresponds to the Lukasiewicz aggregator, p = ∞ corresponds to the minimum aggregator and p = 0 corresponds to the aggregator formed from the drastic product A T D . The derivative of the Yager aggregator is This derivative vanishes whenever holds for a larger fraction of inputs when p increases, with the fraction being 0 for p = 0 as it corresponds to the drastic aggregator and 1 for p = ∞.
The exact fraction of inputs with a nonvanishing derivative is hard to express. 5 However, we can find a closed-form expression for the Euclidean case p = 2.
, where Γ is the Gamma function. See Appendix B.2.2 for proof. We plot the fraction of nonvanishing derivatives for several values of p in Figure 2. For fairly small p, the vast majority of the inputs will have a vanishing derivative, and similar for high n, showing that this aggregator is also of little use in a learning context.

Mean-p Error Aggregator
If we are concerned only in maximizing the truth value of A T Y , we can simply remove the max constraint, resulting in an 'unbounded Yager' aggregator that has a nonvanishing derivative everywhere. However, then the co-domain of the function is no longer [0, 1]. We can do a linear transformation on this function to ensure this is the case (Appendix C.1).
Definition 17. For some p ≥ 0, the Mean-p Error aggregator A pM E is defined as The 'error' here the is difference between the predicted value x i and the 'ground truth' value, 1. This function has the following derivative: When p > 1, this derivative is largest for the inputs that are smallest, which can speed up the optimization by being sensitive to outliers. For p < 1, the opposite is true. A special case is p = 1: having the simple derivative ∂A M AE (x1,...,xn) ∂xi = 1 n . This measure is equal to 1 minus the mean absolute error (MAE ) and is associated with the Lukasiewicz t-norm. Another special case is p = 2: This function is equal to 1 minus the root-mean-square error (RMSE) which is commonly used for regression tasks and heavily weights outliers. We can do the same for the Yager t-conorm min((a p + b p ) 1/p , 1) (Appendix C.1): Definition 18. For some p ≥ 0, the p−Mean aggregator is defined as The case of p = 1 corresponds to the arithmetic mean and p = 2 to the geometric mean. In contrast to the Mean-p Error, its derivative 1 has greater values for smaller inputs when p < 1, and lower values when p > 1. Note that the arithmetic mean A 1M has the same derivative as the mean absolute error A M AE . Unlike A S Y , the only maximum of this aggregator is x 1 , ..., x n = 1. Therefore, it is not a sensible choice for generalizing the ∃ quantifier, and the p-Mean error outperforms it for generalizing ∀.

Product Aggregator
The product aggregator is given as x i . This is also the probability of the intersection of n independent events. It has the following partial derivatives: The derivative is only nonvanishing if there are fewer than two x i so that x i = 0. Furthermore, the derivative for x i will be decreased if some other input x j is low, despite them being independent. Finally, we cannot compute this aggregator in practice due to numerical underflow when multiplying many small numbers. Noting that argmax f (x) = argmax log(f (x)), we observe that the log-product aggregator can be used for formulas in prenex normal form, as then the truth value of the universal quantifiers is not used for another connective. Unlike the other aggregators, its codomain is the non-positive numbers instead of [0, 1]. Furthermore, the log-product aggregator can be seen as the log-likelihood function where we take the correct label to be 1, and thus this is similar to cross-entropy minimization. The partial derivatives are In contrast to Equation 25, the values of the other inputs are irrelevant, and derivatives with respect to lowervalued inputs will be far greater as there is a singularity at x = 0 (i.e. the value becomes infinite). We can conclude therefore that the product aggregator is particularly promising as it is nonvanishing almost everywhere and can handle outliers.

Nilpotent Aggregator
The Nilpotent t-norm is given by The derivative is found as follows: (29) Like the minimum aggregator it is single-passing, and like the Lukasiewicz aggregator it has a derivative that vanishes for the majority of the input space, as it vanishes when the sum of the two smallest values is lower than 1.
Proposition 4. The fraction of inputs x 1 , ..., x n ∈ [0, 1] for which the derivative of A T nM is nonvanishing is For proof, see Appendix B.2.3. The fraction of inputs for which there is a nonvanishing derivative is plotted in Figure 2. Again, this means that for larger numbers of inputs n, this aggregator will vanish on almost every input and is not a useful construction in a learning context.

Summary
The minimum aggregator is computationally inefficient and cannot handle exceptions well. Aggregation operators that vanish when receiving a large amount of inputs will not scale well, and these include operators based on the Yager family of t-norms and the nilpotent aggregator. Removing the bounds from the Yager aggregators introduces interesting connections to loss functions from the classical machine learning literature. This is also the case for the logarithmic version of the product aggregator, which corresponds to the crossentropy loss function. They have natural means for dealing with outliers, and thus are promising for practical use.

Conjunction and Disjunction
Next, we analyze the partial derivatives of t-norms and t-conorms, which are used as conjunction and disjunction in Fuzzy Logics. In t-norm Fuzzy Logics, the weak disjunction max(a, b), or the Gödel t-conorm is used instead of the dual t-conorm.
Suppose that we have a t-norm T and a t-conorm S. We define the following two quantities, where the choice of taking the partial derivative to a is without loss of generality, since T and S are commutative by definition: It should be noted that by Definition 3, d T (a, 1) = 1 as T (a, 1) = a for any t-norm T , and by Definition 5, d S (a, 0) = 1 as S(a, 0) = a for any t-conorm S. Furthermore, we note that if S is a t-conorm and the N C -dual The main difference in analyzing t-norms and t-conorms is that the maximimum of T (a, b) (namely 1) is when both arguments a and b are 1. Consequently, in t-conorms, an infinite number of maxima exist. Some of these maxima might be more desirable than others. Referring back to the formula in Example 1, we showed that it is preferable to increase the truth value of cushion(y) and not of armRest(y). Similarly, when a conjunct is negated, or when it appears in the antecedent of an implication (like in the aforementioned formula) we have to choose which of the two conjuncts to decrease. By noting that d T (a, b) = d S (1 − a, 1 − b), we find that the t-norm "chooses" in the same way its dual t-conorm would "choose". Similarly, if a disjunction is negated, it will minimize both its arguments in the way that its dual t-norm would maximize its arguments.
Example 3. We introduce a running example to analyze the behavior of different t-norms. Let us optimize (a ∧ b) ∨ (c ∧ ¬a) using gradient descent. The truth value of this expression is computed using f (a, b, c) = S(T (a, b), T (1 − a, c)). Using the boundary conditions from Definition 3 and 5, we find the global optima a = 1.0, b = 1.0 and a = 0.0, c = 1.0. The derivative to this function can be computed using the multivariable chain rule: Figure 3: Decision tree for the derivative of S(T (a, b), T (1 − a, c)) with respect to a when using the Gödel t-norm and t-conorm.

Gödel T-Norm
Both T G and S G are single-passing, although they have nonvanishing derivatives only whenever a = b. A benefit of the magnitude of the derivative nearly always being 1 is that there will not be any exploding or vanishing gradients caused by multiple repeated applications of the chain rule.
where the indicator function 1 c returns 1 if the condition c is true, and 0 otherwise. This corresponds to the decision tree in Figure 3. The value of a can be modified to increase the truth of either one of the conjunctions. In order to choose which of the two should be true, it compares a with 1 − a.
). Gradient ascent always finds a global optimum for this formula. A small perturbation in the truth values of the inputs can flip the derivative around. For instance, if a ≤ b and 1 − a ≤ c, then it will increase a if its value is 0.501 and decrease it if it is 0.499. Furthermore, it can cause gradient ascent to get stuck in local optima even for simple problems. For instance, if ϕ = (a ∨ b) ∧ (¬a ∨ c) and a = 0.4, b = 0.2 and c = 0.1, gradient ascent increases a until a > 0.5, at which point the gradient flips and it decreases a until a < 0.5. Experiments with a simple gradient ascent algorithm, have shown us that the algorithm can only find a global optimum in 88.8% of random initializations of a, b and c.

Lukasiewicz T-Norm
The Lukasiewicz t-norm is T LK (a, b) = max(a + b − 1, 0) and the Lukasiewicz t-conorm is S LK (a, b) = min(a + b, 1). The partial derivatives are: These derivatives vanish on as much as half of their domain (Proposition 2). However, like the Gödel t-norm, when there is a gradient, it is large and it will not cause vanishing or exploding gradients.
Example 5. Using the Lukasiewicz t-norm and t-conorm in Equation 32 gives rise to the following computation Choosing random values to initialize a, b and c, gradient descent is able to find a (global) optimum in about 83.5% of the initializations.

Yager T-Norm
The family of Yager t-norms (Yager, 1980 1 p , 0) and the family of Yager t-conorms is S Y (a, b) = min((a p + b p ) 1 p , 1) for p ≥ 1. We plot these for p = 2 in are given by We plot these derivatives in Figure 5, showing for each a vanishing derivative on a non-negligible section of the domain. Using the method described in footnote 5 (Section 6.3), Mathematica finds a closed form expression for the fraction of inputs for which the Yager t-norm is nonvanishing as . Observe that when p = 1, the derivative of T Y is undefined at a = b = 1 and the derivative of S Y is undefined at a = b = 0. For p > 1, the lower of the two truth values has a higher derivative for the t-norm, while for the t-conorm, the higher of the two truth values has a higher derivative. As p increases, T Y and S Y will behave more like T G and S G . Note that when p < 1, the t-norm will have higher derivatives for higher inputs as the derivative has a singularity at lim a→1 = ∞ (b < 1).

Product T-Norm
The product t-and t-conorms, visualized in Figure 6, are The derivative of the t-norm only vanishes when a = b = 0 and similarly the gradient of the t-conorm only vanishes when a = b = 1. The derivative of the t-norm can be interpreted as follows: 'If we wish to increase a ∧ b, a should be increased in proportion to b.' This is not a sensible learning strategy: If both a and b are small, in which case the conjunction is most certainly not satisfied, the derivative will be low instead of high. The derivative of the t-conorm is more intuitive, as it says 'If we wish to increase a ∨ b, a should be increased in proportion to 1 − b'. If b is not yet true, we definitely want at least a to be true.
Example 6. By using the product t-norm and t-conorm in Equation 32, we get As explained, increase a in proportion to b if it is not true that c and ¬a are true, and decrease a in proportion to c if it is not true that a and b are true.

Summary
The Gödel t-norm and t-conorm are simple and effective, having strong derivatives almost everywhere. However, they can be quite brittle by making very binary choices. The Lukasiewiczt-norm and t-conorm also have strong derivatives, but vanish on half of the domain. The Yager family of t-norms and t-conorms also vanish on a significant part of its domain. The derivative of the t-norm is larger for lower values, which is a sensible learning strategy. This is not the case for the product t-norm, where the derivative is dependent on the other input value. However, the product t-conorm is intuitive, and corresponds to the intuition that if one input is not true, the other one should be.

Implication
Finally, we consider what functions are suitable for modelling the implication. We will start by discussing the particular challenges associated with the implication operator.

Challenges of the Material Implication
A significant proportion of background knowledge is written as universally quantified implications. Examples of such statements are 'all humans are mortal', 'laptops consist of a screen, a processor and a keyboard' and 'only humans wear clothes'. These formulas are of the form ∀x φ(x) → ψ(x), where we call φ(x) the antecedent and ψ(x) the consequent.
The implication is used in two well known rules of inference from classical logic. Modus ponens inference says that if ∀x φ(x) → ψ(x) and we know that φ(x) is true, then ψ(x) should also be true. Modus tollens inference says that if ∀x φ(x) → ψ(x) and we know that ψ(x) is false, then φ(x) should also be false, as if φ(x) were true, ψ(x) should also have been.
Unlike sequences of conjunctions where each of the formulas should simply be true, when the learning agent predicts a scene in which an implication is false, the supervisor has multiple choices to correct it. Consider the implication 'all ravens are black'. There are 4 categories for this formula: black ravens (BR), non-black non-ravens (NBNR), black non-ravens (BNR) and non-black ravens (NBR). Assume our agent observes an NBR, which is inconsistent with the background knowledge. There are then four options to consider.

Modus Ponens (MP):
The antecedent is true, so by modus ponens, the consequent is also true. That is, we trust the agent's observation of a raven and believe it to be an BR. 2. Modus Tollens (MT): The consequent is false, so by modus tollens, the antecedent is also false. That is, we trust the agent's observation of a non-black object and believe that it was not a raven (NBNR).

Distrust:
We believe the agent is wrong (both about observing a raven and about observing a non-black object) and it is probably a black object which is not a raven (BNR). 4. Exception: We trust the agent and ignore the fact that its observation goes against the background knowledge that ravens are black. 6 Hence, it has to be a non-black raven (NBR).
The distrust option seems somewhat useless, and the exception option is often going to be correct, but we cannot know when this is just from the agent's observations alone. In such cases, DFL would not be very useful since it would not teach the agent anything new. We can safely assume that there are far more non-black objects which are not ravens than there are ravens. We can argue that from a statistical perspective, it is most likely that the agent observed an NBNR. This shows the imbalance associated with the implication, which was first noted in van Krieken et al. (2019) for the Reichenbach implication. It is quite similar to the class imbalance problem in Machine Learning (Japkowicz and Stephen, 2002) in that the real world has far more 'negative' (or contrapositive) examples than positive examples of the background knowledge.
This problem is closely related to the Raven paradox (Hempel, 1945;Vranas, 2004;van Krieken et al., 2019) from the field of confirmation theory which ponders what evidence can confirm a statement like 'ravens are black'. It is usually stated as follows: • Premise 1: Observing examples of a statement contributes positive evidence towards that statement.
• Premise 2: Evidence for some statement is also evidence for all logically equivalent statements.
• Conclusion: Observing examples of non-black non-ravens is evidence for 'all ravens are black'.
The conclusion follows from the fact that 'non-black objects are non-ravens' is logically equivalent to 'ravens are black'. Although we are considering logical validity instead of confirmation, we note that for DFL a similar thing happens. When we correct the observation of an NBR to a BR, the difference in truth value is equal to when we correct it to NBNR. More precisely, representing 'ravens are black' as I(a, b), where, for example, I(1, 1) corresponds to BR: as I(0, 0) = I(1, 1) = 1. Furthermore, when one agent observes a thousand BR's and a single NBR, and another agent observes a thousand NBNR's and a single NBR, their truth value for 'ravens are black' is equal. This seems strange, as the first agent has actually seen many ravens of which only a single exception was not black, while the second only observed many non ravens which were not black, among which a single raven that was not black either. Intuitively, the first agent's beliefs seem to be more in line with the background knowledge. We will now proceed to analyse a number of implication operators in the light of this discussion.

Analyzing the Implication Operators
We define two functions for a fuzzy implication I: d Ic is the derivative with respect to the consequent and d I¬a is the derivative with respect to the negated antecedent. We choose to take the derivative with respect to the negated antecedent as it makes it slightly easier to compare the two: all fuzzy implications are monotonically decreasing with respect to the antecedent.
A consequence of contrapositive differentiable symmetry is that if c = 1 − a, then the derivatives are equal since c). This could be seen as the 'distrust' option in which it increases the consequent and negated antecedent equally.
Proposition 5. If a fuzzy implication I is N C -contrapositive symmetric, where N C is the classical negation, it is also contrapositive differentiable symmetric.
Proof. Say we have an implication I that is N C -contrapositive symmetric. We find that d Ic (a, c) = ∂I(a,c) In particular, by this proposition all S-implications 7 are contrapositive differentiable symmetric. This says that there is no difference in how the implication handles the derivatives with respect to the consequent and antecedent.
Proof. First, assume I is left-neutral. Then for all c ∈ [0, 1], I(1, c) = c. Taking the derivative with respect to c, it turns out that d Ic (1, c) = 1. Next, assume I is contrapositive differentiable symmetric. Then, All S-implications and R-implications are left-neutral, but only S-implications are all also contrapositive differentiable symmetric. The derivatives of R-implications vanish when a ≤ c, that is, on no less than half of the domain. Note that the plots in this section are rotated so that the smallest value is in the front to help understand the shape of the functions. In particular, plots of the derivatives of the implications are rotated 180 degrees compared to the implications themselves. Implications based on the Gödel t-norm make discrete choices and are single-passing. As I KD (a, c) = max(1 − a, c), the derivatives are

Gödel-based Implications
Or, simply put, if we are more confident in the truth of the consequent than in the truth of the negated antecedent, increase the truth of the consequent. Otherwise, decrease the truth of the antecedent. This decision can be somewhat arbitrary and does not take into account the imbalance of modus ponens and modus tollens.
The Gödel implication is a simple R-implication: Its derivatives are: These two implications are shown in Figure 7. The Gödel implication increases the consequent whenever a > c, and the antecedent is never changed. This makes it a poorly performing implication in practice. For example, consider a = 0.1 and c = 0. Then the Gödel implication increases the consequent, even if the agent is fairly certain that neither is true. Furthermore, as the derivative with respect to the negated antecedent is always 0, it can never choose the modus tollens correction, which, as we argued, is actually often the best choice.

Lukasiewicz and Yager-based Implications
The Lukasiewicz implication is both an S-and an R-implication. It is given by I LK (a, c) = min(1 − a + c, 1) and has the simple derivatives Whenever the implication is not satisfied because the antecedent is higher than the consequent, simply increase the negated antecedent and the consequent until it is lower. This could be seen as the 'distrust' choice as both observations of the agent are equally corrected, and so does not take into account the imbalance between modus ponens and modus tollens cases. The derivatives of the Gödel implication I G are equal to those of I LK except that I G always has a zero derivative for the negated antecedent. The Yager S-implication is given as We plot I Y for p = 2 in Figure 8. For p = 1, I Y reduces to I LK , for p = 0 to I DP , and for p = ∞ to I KD . Since it is an S-implication, it is contrapositive symmetric with respect to N C , left-neutral and it satisfies the exchange principle. For p ≤ 1, it satisfies the identity principle. The derivatives are computed as We plot these derivatives for p = 2 in Figure 9. For all p, lim c→0 d I Y c (1, c) = 1. Furthermore, for p > 1, lim a→1 d I Y c (a, 0) = 0 and for p < 1, lim a→1 d I Y c (a, 0) = ∞. For p > 1, I Y can be understood as an increasingly less smooth version of the Kleene-Dienes implication I KD . Lastly, this derivative, like those for T Y and S Y (Section 7.3), is nonvanishing for only a fraction of of the input space.
The Yager R-implication is found (see Appendix C.4 for details) as We plot I T Y for p = 2 in Figure 8. As expected, p = 1 reduces to I LK , p = 0 reduces to I W B and p = ∞ reduces to I G . It is contrapositive symmetric only for p = 1. The derivatives of this implication are We plot these in Figure 10.

Product-based Implications
The product S-implication, also known as the Reichenbach implication, is given by I RC (a, c) = 1 − a + a · c. We plot it in Figure 11. Its derivatives are given by: These derivatives closely follow the modus ponens and modus tollens rules. When the antecedent is high, increase the consequent, and when the consequent is low, decrease the antecedent. However, around (1 − a) = c, the derivative is equal and the 'distrust' option is chosen. This can result in counter-intuitive behaviour. For example, if the agent predicts 0.6 for raven and 0.5 for black and we use gradient descent until we find a maximum, we could end up at 0.3 for raven and 1 for black. We would end up increasing our confidence in black as raven was high. However, because of additional modus tollens reasoning, raven is barely true.
Furthermore, if the agent most of the time predicts values around a = 0, c = 0 as a result of the modus tollens case being the most common, then a majority of the gradient decreases the antecedent as d I RC ¬a (0, 0) = 1. We identify two methods that counteract this behavior. We introduce the second method in Section 8.5.1.
The first method for counteracting the 'corner' behavior notes that different aggregators change how the derivatives of the implications behave. In particular, we compare the log-product aggregator and the RMSE aggregator and how they combine with the Reichenbach implication. By using the chain rule and Equation 21, we find that the derivatives with respect to the negated antecedent using those aggregators are: ∂ log •A P (I RC (a 1 , c 1 ), ..., I RC (a n , c n )) ∂A RM SE (I RC (a 1 , c 1 ), ..., I RC (a n , c n )) We plot these functions with respect to a i and c i in Figure 12. Note that for the RMSE aggregator, the truth values of other inputs a j , c j , i = j can change the shape of the function. We arbitrarily choose n = 2 and a 1 , c 1 so that (a 1 − a 1 · c 1 ) 2 = 0.9. Note also that the derivative with respect to the negated antecedent using the RMSE aggregator is 0 in a i = 0, c i = 0 as then a i − a i · c i = 0, and using the log-product aggregator, the derivative is 1. By differentiable contrapositive symmetry, the consequent derivative is 0 when using both aggregators. This shows that when using the RMSE aggregator, the derivatives will vanish at the two corners, i.e. a = 0, c = 0 and a = 1, c = 1. On the other hand, when using the log-product aggregator, one of antecedent and consequent will have a gradient. The R-implication of the product t-norm is called the Goguen implication and is given by We plot this implication in Figure 11. The derivatives of I GG are We plot these in Figure 13. This derivative is not very useful. First of all, both the modus ponens and modus tollens derivatives increase with ¬a. This is opposite of the modus ponens rule as when the antecedent is low, it increases the consequent most. For example, if raven is 0.1 and black is 0, then the derivative with respect to black is 10, because of the singularity when a approaches 0.

Sigmoidal Implications
We introduce a new class of fuzzy implications formed by transforming other fuzzy implications using the sigmoid function and translating it so that the boundary conditions still hold. The derivation, along with several proofs of properties, can be found in Appendix C.2.
Definition 20. If I is a fuzzy implication, then the I-sigmoidal implication σ I is given for some s > 0 and b 0 ∈ R as Here b 0 is a parameter that controls the position of the sigmoidal curve and s controls the 'spread' of the curve. σ I is the function σ (s · (I(a, c) + b 0 )) linearly transformed so that its codomain is the closed interval Next, we give the derivative of σ I . Substituting d = 1+e −s·(1+b 0 ) e −s·b 0 −e −s·(1+b 0 ) and h = 1 + e −s·b0 , we find ∂σ I (a, c) ∂I(a, c) = d · h · s · σ (s · (I(a, c) + b 0 )) · (1 − σ (s · (I(a, c) + b 0 ))).
The derivative keeps the properties of the original function but smoothes the gradient for higher values of s. As the derivative of the sigmoid function (that is, σ(x) · (1 − σ(x))) cannot be zero, this derivative vanishes only when ∂I(a,c) ∂¬a = 0 or ∂I(a,c) ∂c = 0. We plot the derivatives for the Reichenbach-sigmoidal implication σ I RC in Figure 15. As expected by Proposition 14, it is differentiable contrapositive symmetric. Compared to the derivatives of the Reichenbach implication it has a small gradient in all corners. When using the log-product aggregator, the derivative of the antecedent with respect to the total valuation is divided by the truth of the implication. In Figure 14 we compare the consequent derivative of the normal Reichenbach implication with the Reichenbach-sigmoidal implication when using the log function. Clearly, for both there is a singularity at a = 1, c = 0, as then the implication is 0 and so the derivative of the log function becomes infinite. A significant difference is that the sigmoidal variant is less 'flat' than the normal Reichenbach implication. This can be useful, as this means there is a larger gradient for values of c that make the implication less true. In particular, the gradient at the modus ponens case (a = 1, c = 1) and the modus tollens case (a = 0, c = 0) are far smaller, which could help balancing the effective total gradient by solving the 'corner' problem of the Reichenbach implication we brought up in Section 8.5. These derivatives are smaller for for higher values of s.
In Figure 16 we plot the Reichenbach-sigmoidal implication for different values of the hyperparameters b 0 and s. Comparing 16a and 16b we see that larger values of b 0 move the sigmoidal shape so that its center is at lower input values. Note that for s = 0.01 in Figure 16c, the plotted function is indiscernible from the plot of the Reichenbach implication in Figure 11 as the interval on which the sigmoid acts is extremely small and the sigmoidal transformation is almost linear. For very high values of s like in 16d we see that the 'S' shape is much thinner, and a larger part of the domain has a low derivative.
In Figure 17 we show what part of the sigmoid function is utilized for different values of b 0 and s: The restriction 8 between a pair of vertical lines correspond to some value of b 0 and s. Furthermore, because of left-neutrality and contrapositivity, I RC (x, 0) and I RC (1, x) are both equal to x, σ I RC (x, 0) and σ I RC (1, x) are linear transformations of this restriction. For example, the restriction of the sigmoid between the orange bars representing b 0 = −0.2 and s = 9 can be seen on the line σ RC (a, 0) plotted in Figure 16b.

Summary
We analyzed several fuzzy implications from a theoretical perspective, while keeping the challenges caused by the material implication in mind. As a result of this analysis, we find that popular R-implications, in particular the Gödel implication, the Yager R-implication and the Goguen implication, will not work well in a differentiable setting. The other analyzed implications seem to have more intuitive derivatives, but may have other practical issues like non-smoothness.

Experiments
To get an idea of the practical behavior of these implications, and other operators, we now perform a series of simple experiments to analyze them in practice. In this section, we discuss experiments using the MNIST dataset of handwritten digits (LeCun and Cortes, 2010) to investigate the behavior of different fuzzy operators introduced in this paper.

Measures
To investigate the performance of the different configurations of DFL, we first introduce several useful metrics. These give us insight into how different operators behave.
Definition 21. The consequent magnitude |cons| ϕ and the antecedent magnitude |ant| ϕ for a formula ϕ = ∀x 1 , ..., x m φ → ψ is defined as the sum of the partial derivatives of the consequent and antecedent with respect to the DFL loss: where M ϕ is the set of instances of the universally quantified formula ϕ.
Definition 23. Given a labeling function l that returns the truth value of a formula given an instance µ according to the data, the consequent and antecedent correctly updated magnitudes are the sum of partial derivatives for which the consequent or the negated antecedent is true: That is, if the consequent is true in the data, we measure the magnitude of the derivative with respect to the consequent. To evaluate these quantities, we define ratios similar to a precision metric: Definition 24. The correctly updated ratio for consequent and antecedent is defined as These quantify what fraction of the updates are going in the right direction. When these ratios approach 1, DFL will increase the truth value of the consequent or negated antecedent correctly. 10 When it is less, we are increasing truth values of subformulas that are wrong, thus ideally, we want these measures to be high.

Formulas
We use a knowledge base K of universally quantified logic formulas. There is a predicate for each digit, that is zero, one, ..., eight and nine. For example, zero(x) is true whenever x is a handwritten digit labeled with 0. Secondly, there is the binary predicate same that is true whenever both its arguments are the same digit. We next describe the formulas we use. We also note what the values of cu cons % and cu ant % roughly are if we were to pick at random.
1. ∀x, y zero(x) ∧ zero(y) → same(x, y), ..., ∀x, y nine(x) ∧ nine(y) → same(x, y). If both x and y are handwritten zeros, for example, then they represent the same digit. For this formula, cu cons % ≥ 1 10 as it is the distribution of same(x, y) 11 and cu ant % ≤ 99 100 as it is 1 minus the probability that both x and y are zero. The modus ponens case is true in more than 1 100 cases, the modus tollens casein less than 9 10 cases and the 'distrust' option in more than 9 100 cases. 2. ∀x, y zero(x) ∧ same(x, y) → zero(y), ..., ∀x, y nine(x) ∧ same(x, y) → nine(y). If x and y represent the same digit and one of them represents zero, then the other one does as well. For this formula, cu cons % = 1 10 as it is the probability that a digit represents zero and cu ant % ≤ 99 100 . The modus ponens cases is true in more than 1 100 cases, the modus tollens in 9 10 cases and the 'distrust' option in 9 100 cases. 3. ∀x, y same(x, y) → same(y, x). This formula encodes the symmetry of the same predicate. As this is a bi-implication, cu cons % ≥ 1 10 and cu ant % ≤ 9 10 . The 'distrust' option is not possible in this formula. From this, we can see that a set of operators is better than random guessing for the consequent updates if cu cons % > 0.1. It is more difficult to say what the value of cu ant % should be to be as good as random guessing, as the probabilities are upper bounded with the lowest bound at 0.9. We can only say that we know a set of operators to be better than random if cu ant % > 0.99.

Experimental Setup
We split the MNIST dataset so that 1% of it is labeled and 99% is unlabeled. We use two models. 12 Given a handwritten digit x labeled with digit y, the first model p θ (y|x) computes the distribution over the 10 possible labels. We use 2 convolutional layers with max pooling, the first with 10 and the second with 20 filters. Then follows two fully connected hidden layers with 320 and 50 nodes and a softmax output layer. The probability that same(x 1 , x 2 ) for two handwritten digits x 1 and x 2 holds is modeled by p θ (same|x 1 , x 2 ). This takes the 50-dimensional embeddings of x 1 and x 2 of the fully connected hidden layer e x1 and e x2 . These are used in a network architecture called a Neural Tensor Network (Socher et al., 2013): W [1:k] ∈ R d×d×k is used for the bilinear tensor product, V ∈ R k×2d is used for a the concatenated embeddings and b ∈ R k is used as a bias vector. We use k = 50 for the size of the hidden layer. u ∈ R k is used to compute the output logit, which goes through the sigmoid function σ to get the probability. The loss function we use is split up in three parts: The first term is the supervised cross entropy loss with a batch size of 64. The second is the DFL loss which is weighted by the DFL weight w DF L . The third is the supervised binary cross entropy loss used to learn recognize same(x, y).

Results
We analyze the results for many different combinations of hyperparameters, in particular by choosing different operators for aggregation, conjunction and implication. It should be noted that the purely supervised baseline has a test accuracy of 95.0% ± 0.001 (3 runs). Semi-supervised methods that do not improve upon this baseline are useless.
We report the accuracy of recognizing digits in the test set. We do learning for at most 100.000 iterations (or until convergence). We also report the consequent ratio cons% and the consequent and antecedent correctly updated ratios cu cons % and cu ant %. We can compute these values during the backpropagation of the DFL loss on the 'unlabeled' dataset. Because it is a split of a labeled dataset, we can access the labels for evaluation.  Table 6: Results on the MNIST problem for several symmetric configurations. For all, w DF L = 1 except for T P , for which w DF L = 10.
First, we consider several symmetric configurations. A symmetric configuration is one for which the conjunction is a t-norm T , disjunction the dual t-conorm of T , aggregation the extended t-norm A T (Equation 2) and the implication the S-implication based on the t-conorm. For example, for T P we use T P for conjunction, S P for disjunction, A log T P = log •A T P for aggregation and I RC for implication (i.e. DPFL from Section 7.4). Symmetric configurations have the benefit of retaining many equivalence relations in fuzzy logic compared to an arbitrary configuration of operators. All are run with w DF L = 1 except for T P which is run using w DF L = 10. The results can be found in Table 6.
The Gödel t-norm performs worse than the supervised baseline. This is probably because the min aggregator is not a very good choice as it only increases the truth value of a single instance, which might be just an exception as argued in Section 6.1 and evident from the low values of cu cons % and cu ant %.
The Lukasiewicz t-norm with A LK performs much worse than the supervised baseline. What is likely happening is that because A LK either has a derivative of 0 or 1 everywhere, the total gradient is very large when the condition is met. If instead we use the mean average error, the results stabilize and end slightly higher than the supervised baseline. By the definition of I LK , cons% = 1 2 as the consequent and negated antecedent derivatives are equal (see Equation 42). cu cons % is very low with only 0.05, which as we argued is worse than random guessing. As halve of the gradient is MP reasoning, that halve is nearly always incorrect.
The performance of the Yager t-norm seems highly dependent on the choice of the parameter p. For p = 20 the top performance is quite a bit higher than the baseline, although in the end it drops. However, for p = 2, the results are even worse than the Lukasiewicz t-norm, which corresponds to p = 1.
The product t-norm performs best and also has the highest values for cu cons % and cu ant %. To a large extend this is because the log-product aggregator is very effective for this problem as we will see in the next section.

Varying the Aggregators
In this section, we analyze symmetric configurations, except that we use aggregators other than the one formed by extending the t-norm. In particular, we will consider the RMSE aggregator (A pM E with p = 2) and the log-product aggregator A log T P .  Table 7: Configurations using the RMSE aggregator with w DF L = 1 and the log product aggregator with w df l = 10. Table 7 shows the results when using the RMSE aggregator and a DFL weight of 1 and the log product aggregator and a DFL weight of 10. Nearly all configurations perform significantly better using these aggregators than when using their 'symmetric' aggregator. In particular, the Gödel, Lukasiewicz and Yager t-norms all outperform the baseline with both aggregators as they are differentiable everywhere and can handle outliers.
The product t-norm seems to do slightly worse with the RMSE aggregator than with the log-product aggregator. Like we discussed in Section 8.5, cons% is higher using this aggregator because the corners a i = 0, c i = 0 and a i = 1, c i = 1 will have no gradient when using the RMSE aggregator. However, the values of cu cons % and cu ant % are much lower than when using the log-product aggregator. This could have to do with the previously made point: As it no longer has a gradient of 1 at the corners a = 0, c = 0 and a = 1, c = 1, the large gradients are only when the agent is not yet confident about some prediction. This case is inherently 'riskier', but also contributes more information. It is not as informative to increase the confidence of a = 0 if a is already very low. The Lukasiewicz t-norm has a particularly high accuracy of 96.9% with the log product and is on the level of performance of the product t-norm. However, it has a very low value for cu cons % of 0.06 and a relatively low value for cu ant %. Interestingly, it is also the only configuration for which cu cons % is higher when using the RMSE aggregator than the log-product aggregator.

Reichenbach-Sigmoidal Implication
The Reichenbach-sigmoidal implication σ I RC is a promising candidate for the choice of implication as we have argued in Section 8.5.1. We fix the aggregator to the log-product, the conjunction operator to the Yager t-norm with p = 2, and use a DFL weight of w DF L = 10.
On the left plot of Figure 18 we find the results when we experiment with the parameter s, keeping b 0 fixed to − 1 2 . Note that when s approaches 0 the Reichenbach-sigmoidal implication is I RC . The value of 9 gives the best results, with 97.3% accuracy. Interestingly enough, there seem to be clear trends in the values of cons%, cu cons % and cu ant %. Increasing s seems to increase cons%. This is because the antecedent derivative around the corner a = 0, c = 0 will be low, as argued in Section 8.5.1. When s increases, the corners will be more smoothed out. Furthermore, both cu cons % and cu ant % decrease when s increases. This could again be because around the corners the derivatives become small, and these are often the 'safest', as the model is already confident about those. For a higher value of s, most of the gradient magnitude is at less 'safe' instances. We note that the same happened when using the RMSE aggregator and the product t-norm. Regardless, the best parameter value clearly is not the one for which the values of cu cons % and cu ant % are highest, namely the Reichenbach implication itself.
On the right plot of Figure 18 we experiment with the value of b 0 . Clearly, − 1 2 works best, having the highest accuracy and cu cons %.  Table 8: The results using σ I RC for the implication, A log T P for the aggregator with w 1 = 10 and several different conjunction operators.

Conjunction Operators and Aggregators
Next, we compare the behavior of different tnorms in Table 8. As conjunctions are only used in the antecedents of implications, they get negative derivatives and thus have to choose which of the conjuncts to decrease. Therefore, they act like their dual t-conorm (see Section 7). The differences in accuracy are small and not significant except for the Nilpotent minimum. Table 9 shows the results when varying the aggregator and DFL weight w DF L . A 2M refers to the geometric mean by using p = 2 in Equation 24. The log-product operator with w DF L = 10 does significantly better than the other aggregators. The geometric mean with w DF L = 10 has a very high value for cu cons % but still performs worst of the four. This is because it assigns the highest derivative to the most satisfied assignments, which are likely already correct.   The results using T Y , p = 2 for the conjunction, A log T P for the aggregator with w 1 = 10 and several different S-implications and R-implications.

Implications
In Table 10, we compare different fuzzy implications, keeping the conjunction operator fixed to the Yager t-norm with p = 2, the aggregator to the log-product and the DFL weight to 10. Again, the Reichenbach implication and the Lukasiewicz implication work well, both having an accuracy around 97%. The Kleene Dienes and the Yager S-and R-implications surpass the baseline as well.
The Gödel implication and Goguen implication have worse performance than the supervised baseline. While the derivatives of I LK and I G only differ in that I G disables the derivatives with respect to negated antecedent, I LK performs among the best but I G performs among the worst, suggesting that the derivatives with respect to the negated antecedent are required to successfully applying DFL. The Fodor implication is comparable in performance to the Kleene Dienes implication, which is not surprising as they are equal for all a > c.

Influence of Individual Formulas
Formulas used Accuracy cons% cu cons % cu ant % (1) and (2) 97.1 0.05 0.54 0.99 (2) and (3) 95.9 0.12 0.75 0.95 (1) and (3) 96 . Table 11: The results using σ I RC for the implication with s = 9 and b 0 = − 1 2 , T Y , p = 2 for the conjunction and A log T P for the aggregator with w DF L = 10, leaving some formulas out. The numbers indicated the formulas that are present during training.
Finally, we compare what the influence of the different formulas are in Table 11. Removing the reflexivity formula (3) does not largely impact the performance. The biggest drop in performance is by removing formula (1) that defines the same predicate. Using only formula (1) gets slightly better performance than only using formula (2), despite the fact that no positive labeled examples can be found using formula (1) as the predicates zero to nine are not in its consequent. Since 95% of the derivatives are with respect to the negated antecedent, this formula contributes by finding additional counterexamples. Furthermore, improving the accuracy of the same predicate improves the accuracy on digit recognition: Just using the reflexivity formula (3) has the highest accuracy when used individually, even though it does not use the digit predicates.

Analysis
We plot the accuracy of the different configurations with respect to cu cons and cu ant in Figures 19a and 19b. Figure 19b seems to show a positive correlation. Furthermore, the best configurations using σ I RC are the ones with the highest value of cu ant . Although there seems to be a slight positive correlation in Figure 19a, it is not as pronounced and the configurations with the highest accuracy are not quite the ones with the highest value  of cu cons . Furthermore, there are decently performing methods, that use the Lukasiewicz implication, with low values for cu cons . We plot the values of cons% to the values of cu cons % and cu ant % in Figures 19c and 19d. For both, there seems to be a negative correlation. Apparently, if the ratio of derivatives with respect to the consequent becomes larger, then this decreases the correctness of the updates. In Section 9.4.3 we argued, when experimenting with the value of s, that this could be because for lower values of cons%, a smaller portion of the reasoning happens in the 'safe' corners around a = 0, c = 0 and a = 1, c = 1, and more for cases that the agent is less certain about. As all S-implications have strong derivatives at both these corners (Proposition 6), this phenomenon is likely present in other S-implications.

Conclusions
We have run experiments on many different configurations of hyperparameters to explore what works and what does not. The only well performing fully symmetric option is the product t-norm. If we are willing to forego symmetry, we find that using the Reichenbach-sigmoidal implication with the log-product aggregator and a DFL weight of 10 has somewhat better performance. The choice of conjunction does not seem to matter as much. The Lukasiewicz implication additionally does well with the log-product aggregator, suggesting that the choice of aggregator and DFL weight is quite vital.
Although Differentiable Fuzzy Logics significantly improves on the supervised baseline and is thus suited for semi-supervised learning, it is not competitive with state-of-the-art methods like Ladder Networks (Rasmus et al., 2015) which has an accuracy of 98.9% for 100 labeled pictures and 99.2% for 1000. 14

Related Work
Differentiable Fuzzy Logics falls into the discipline of Statistical Relational Learning (Getoor and Taskar, 2007), which concerns models that can reason under uncertainty and learn relational structures like graphs.

Differentiable Fuzzy Logics
Special cases of DFL have been researched in several papers under different names. Real Logic (Serafini and Garcez, 2016) implements function symbols and uses a neural model called Logic Tensor Networks to interpret predicates. It uses t-norms with their dual t-conorm and their respective S-implications. Real Logic is applied to weakly supervised learning on Scene Graph Parsing (Donadello et al., 2017) and transfer learning in Reinforcement Learning (Badreddine and Spranger, 2019).
Semantic-based regularization (SBR) (Diligenti et al., 2017a) applies DFL to kernel machines. They use R-implications and the mean aggregator. Sen et al. (2008) applies SBR to collective classification by predicting using a trained deep learning model, and then using gradient descent on the DFL loss to find new truth values. Using this method, predictions are consistent with the formulas during test-time. Marra et al. (2019b) uses t-norm Fuzzy Logics, where the R-implication is used alongside weak disjunction. By using t-norms based on generator functions, the satisfiability computation can be simplified and generalizations of common loss functions can be found. Marra et al. (2018) applies DFL to image generation. It uses the product t-norm, the log-product aggregator and the Goguen implication. By using function symbols that represent generator neural networks, they create constraints that are used to create a semantic description of an image generation problem. Rocktäschel et al. (2015) uses the product t-norm and reichenbach implication for relation extraction by using an efficient matrix embedding of the rules. Guo et al. (2016) extends this to link prediction and triple classification by using a margin-based ranking loss for implications. Demeester et al. (2016) uses a regularization technique equivalent to the Lukasiewicz implication. Instead of using existing data, it finds a loss function which does not iterate over objects, yet can guarantee that the rules hold. This is very scalable, but can only model simple implications. A promising approach is using adversarial sets (Minervini et al., 2017), which is a set of objects from the domain that do not satisfy the knowledge base. These are probably the most informative objects. It uses gradient descent to find objects that minimize the satisfiability. The parameters of the deep learning model are then updated so that it predicts consistent with the knowledge base on this adversarial set. A benefit of this approach is that it does not have to iterate over instances that already satisfy the constraints. Adversarial sets are applied to natural language interpretation in (Minervini and Riedel, 2018). Both papers use the Lukasiewicz implication and Gödel t-and t-conorm. They are not able to infer new labels on existing unlabeled data as they use artificial data, but these methods are not orthogonal and can be used jointly.

Methods using Differentiable Fuzzy Logic Operators
Posterior regularization (Ganchev and Gillenwater, 2010;Hu et al., 2016) is a framework for weaklysupervised learning on structured data. It projects the output of a deep learning model to a 'rule-regularized subspace' to make it consistent with the knowledge base. This output is used as a label for the deep learning model to imitate. Unlike this paper, it does not compute derivatives over the computation of the satisfaction of the knowledge base. Marra et al. (2019a) and Daniele and Serafini (2019) both instead use gradient descent for the projection. Therefore, unlike the other methods mentioned here, the derivatives with respect to the operators are relevant. They learn the formula weights jointly with the parameters of the deep learning model.
∂ILP (Evans and Grefenstette, 2018) is a differentiable inductive logic programming that uses the product t-norm and t-conorm to do differentiable inference. The Neural Theorem Prover  does differentiable proving of queries and combines different proof paths using the Gödel t-norm and t-conorm. Sourek et al. (2015) also introduces a method for differentiable query proving, with learnable weights for rules. They use operators inspired by fuzzy logic and transformed by the sigmoid function.
There is a lot of literature on Fuzzy Neural Networks (Jang, 1993;Jang et al., 1997;Lin and Lee, 1991) that replace standard neural network neurons with neurons based on fuzzy logic. Some of the neurons use fuzzy logic operators which are differentiated through if the networks are trained using backpropagation.

Differentiable Probabilistic Logics
Some approaches use probabilistic logics instead of fuzzy logics and interpret predicates probabilistic. As deep learning classifiers model probability distributions, probabilistic logics could be a more natural choice than fuzzy logics. DeepProbLog (Manhaeve et al., 2018) is a probabilistic logic programming language with neural predicates that compute the probabilities of ground atoms. It supports automatic differentiation which can be used to back-propagate from the loss at a query predicate to the deep learning models that implement the neural predicates, similar to DFL. It supports probabilistic rules which can handle exceptions to rules. We compare another differentiable probabilistic logic called Semantic Loss (Xu et al., 2018) in Appendix D and show similarities between it and DFL using operators based on the product t-norm. This similarity suggests that many practical problems that DPFL has are also present in Semantic Loss. They apply Semantic Loss to MNIST semi-supervised learning with a different knowledge base than ours. As inference is exponential in the size of the grounding for probabilistic logics, both approaches use an advanced compilation technique (Darwiche, 2011) to make inference feasible for larger problems.

Discussion
This paper presented theoretical results of Differentiable Fuzzy Logics operators and then evaluated their behavior on semi-supervised learning on MNIST.
We now discuss some additional problems with deploying solutions using DFL. DFL can be seen as a form of multi-objective optimization (Hwang and Masud, 2012). In the DFL loss (Equation 13) we sum up the valuations of different formulas, each of which is a separate objective. The loss landscape can significantly change when weights for individual formulas is changed. As we saw in Section 9.4.4, a lower value of the DFL weight has worse performance. Having so many different objectives requires significant hyperparameter tuning. A method capable of learning weights for the formulas jointly like (Marra et al., 2019a;Daniele and Serafini, 2019;Šourek et al., 2015), could solve this problem.
A second major challenge is related to the class imbalance problem (Japkowicz and Stephen, 2002;Buda et al., 2018). We argued in Section 8.1 that for a significant portion of common-sense background knowledge, the modus tollens case (i.e., the non-black non-raven) is by far the most common of the four cases. The simple and small MNIST problem indeed showed that most well-performing implications have a far larger derivative with respect to the negated antecedent than to the consequent. This imbalance will only increase for more complex problems. However, simply removing derivatives with respect to the antecedent does not seem to be the solution. A reason for this could be that those are usually correct, unlike derivatives with respect to the consequent. In fact, we found in Section 9.4.6 that the formula in which the digits are in the antecedent performs better on its own than the formula in which the digits are in the consequent, even though the model could not learn from any new positive examples.
Although we have focused on experimenting with the accuracy of the derivatives of the implication, it should be noted that the derivatives of the disjunction operator make a choice as well. For example, if the agent observes a walking object and the supervisor knows that only humans and animals can walk, how is the supervisor supposed to choose whether it is a human or an animal? Here, similar imbalances exist in the different possible classes: There might be more images of humans than of animals.
Lastly, we pose the question whether it is more important that we choose operators based on the performance on the task at hand, or based on its logical properties. The choice of operators that performed best in the MNIST problem uses the Gödel t-norm, the log-product aggregator and the Reichenbach-sigmoidal implication. None of these three operators are based on the same t-norm, that is, this choice of operators is not 'symmetric'. Differentiable Product Fuzzy Logic is the only viable choice in which all of the operators can be chosen based on one t-norm, and it can only be used for formulas in prenex form. The largest benefit of a 'symmetric' choice of operators is that the truth value of formulas that are logically equivalent in classical logic will be equal. This makes it easier to analyze how the background knowledge will behave and does not require putting it in a particular form.

Conclusion
We analyzed Differentiable Fuzzy Logics in order to understand how reasoning using logical formulas behaves in a differentiable setting. We examined how the properties of a large amount of different operators affect DFL. We have found substantial differences between the properties of a large number of such Differentiable Fuzzy Logics operators, and we showed that many of them, including some of the most popular operators, are highly unsuitable for use in a differentiable learning setting. By analyzing aggregation functions, we found that the log-product aggregator and the RMSE aggregator have convenient connections to both fuzzy logic and machine learning and can deal with outliers. Next, we analyzed conjunction and disjunction operators and found several strong candidates. In particular, the Gödel t-and t-conorms are a strong yet simple choice, and the Yager t-norm and the product t-conorm have intuitive derivatives.
We noted an interesting imbalance between derivatives with respect to the negated antecedent and the consequent of the implication. Because the modus tollens case is much more common, we conclude that a large part of the useful inferences on the MNIST experiments are made by decreasing the antecedent, or by 'modus tollens reasoning'. Furthermore, we found that derivatives with respect to the consequent often increase the truth value of something that is false as the consequent is false in the majority of times. Therefore, we argue that 'modus tollens reasoning' should be embraced in future research. As a possible solution to problems caused by this imbalance, we introduced a smoothed fuzzy implication called the Reichenbach-sigmoidal implication.
Experimentally, we found that the product t-norm is the only t-norm that can be used as a base for all choices of operators. The product t-conorm and the Reichenbach implication have derivatives that are intuitive and that correspond to inference rules from classical logic, and the log-product aggregator is among the most effective. The logic based on the product t-norm that we call Differentiable Product Fuzzy Logic (DPFL) has connections to probabilistic methods.
While the Lukasiewicz t-norm has about the same performance as DPFL on the MNIST problem when using the log-product aggregator, the Lukasiewicz aggregator vanishes on most of its domain and its relaxed version, the mean average error aggregator, is not able to distinguish outliers. However, the Lukasiewicz implication is the best R-implication in our experiments. Lastly, the Reichenbach-sigmoidal implication performs best on the MNIST experiments. The hyperparameters of sigmoidal implications can be tweaked to decrease the imbalance of the derivatives with respect to the negated antecedent and consequent. In order to gain the largest improvements over a supervised baseline, we had to abandon the normal symmetric configurations of norms, where both t-norms, s-norms and the aggregation operators satisfy the usual algebraic relations. Instead, we had to resort to non-symmetric configurations where different norms are combined.
We believe a proper empirical comparison of different methods that introduce background knowledge through logic could be useful to properly understand the details, performance, possible applications and challenges of each method. Secondly, we believe more work is required in using background knowledge to help deep models train on real-world problems. One research direction would be to develop methods that can properly deal with exceptions. An approach in which the weights for the different formulas can be learned could be used to distinguish between relevant and irrelevant formulas in the background knowledge, and probabilistic instead of fuzzy logics could be a more natural fit. Lastly, additional research on the vast space of fuzzy logic operators might find more properties that are useful in DFL. if ϕ = P(x 1 , ..., x m ) then 3: return g[P, (µ(x 1 ), ..., µ(x m )] Find the truth value of a ground atom using the dictionary g.

4:
else if ϕ = ¬φ then else if ϕ = ∀x φ then Apply the aggregation operator as a quantifier.

13:
return Ao∈C e N,T,S,I,A (φ, g, C, µ ∪ {x/o}) Each assignment can be seen as an instance of ϕ. The computation of the satisfaction is shown in pseudocode form in Algorithm 1. By first computing the dictionary g that contains truth values for all ground atoms, 15 we can reduce the amount of forward passes through the computations of the truth values of the ground atoms that are required to compute the satisfaction.
This algorithm can fairly easily be parallelized for efficient computation on a GPU by noting that the individual terms that are aggregated over in line 12 (the different instances of the universal quantifier) are not dependent on each other. By noting that formulas are in prenex normal form, we can set up the dictionary g using tensor operations so that the recursion has to be done only once for each formula. This can be done by applying the fuzzy operators elementwise over vectors of truth values instead of a single truth value, where each element of the vector represents a variable assignment.
The complexity of this computation then is O(|K| · P · b d ), where K is the set of formulas, P is the amount of predicates used in each formula and d is the maximum depth of nesting of universal quantifiers in the formulas in K (known as the quantifier rank). This is exponential in the amount of quantifiers, as every object from the constants C has to be iterated over in line 12, although as mentioned earlier this can be mitigated somewhat using efficient parallelization. Still, computing the valuation for transitive rules (such as. ∀x y, z Q(x, z) ∧ R(z, y) → P(x, y)) will for example be far more demanding than for antisymmetry formulas (such as ∀x, y P(x, y) → ¬P(y, x)).

Appendix B. Proofs
Appendix B.1. Single-Passing Proposition 7. Any composition of single-passing functions is also single-passing.
Proof. We will proof this by induction. Let f : R n → R be a single-passing function and let x 1 , ..., x n ∈ R. Then clearly f (x 1 , ..., x n ) is single-passing.
Next, let g : R n → R be a single-passing function and each y 1 , ..., y n be a composition of single-passing functions. Let X i be the set of inputs to y i . For any x ∈ X i holds that ∂g (y 1 (X 1 ), ..., y n (X n )) ∂x = ∂g (y 1 (X 1 ), ..., y n (X n )) ∂y i (X i ) As g is single-passing, there is at most 1 y i so that ∂g(y1(X1),...,yn(Xn)) ∂yi(Xi) = 0. If there is 0, then there can also be no x ∈ ∩ j X j such that ∂g(y1(X1),...,yn(Xn)) ∂x = 0. If there is 1, then by the inductive hypothesis, y i (X i ) is singlepassing. Therefore, there is at most 1 value x ∈ X i so that ∂yi(Xi) ∂x = 0 and by Equation B.1 there is at most 1 value x ∈ ∩ j X j such that ∂g(y1(X1),...,yn(Xn)) ∂x = 0. We conclude that g (y 1 (X 1 ), ..., y n (X n )) is single-passing.

2.
σ I (a 1 , c 1 ) = w · σ (s · (I(a 1 , c 1 ) + b 0 )) − h = w · σ (s · (I(a 2 , c 2 ) + b 0 )) − h = σ I (a 2 , c 2 ) As there are no loops when each ground atom appears uniquely in ϕ, the factor graph over which loopy belief propagation is done is a tree. As e θ (ϕ; ∅) corresponds to a single iteration of loopy belief propagation, this is equal to regular belief propagation which is an exact method for computing queries on probabilistic models (Pearl, 1988). Clearly, this condition on ϕ is very strong. Although loopy belief propagation is known to often be a good approximation empirically (Murphy et al., 2013), the degree to which DPFL approximates Semantic Loss requires further research as this is not a guarantee. However, if DPFL approximates Semantic Loss well, it can be a strong alternative as it is not an exponential computation. However, it also means that most problems of DPFL will also be present in Semantic Loss. For example, if we just have the formula ∀raven(x) → black(x), the grounding of the knowledge base will not contain repeated ground atoms, and thus Semantic Loss and DPFL are equivalent and share difficulties related to the imbalance of modus ponens and modus tollens.
A Bayesian network is a joint distribution factorized as p(x) = n i=1 p(x i |x {pa(i)} ) where x {pa(i)} is the set of random variables that are parents of x i . In particular, we are interested in the joint distribution p(ϕ, w|η θ ). We use the compositional structure of ϕ to expand p(ϕ|w).

(D.3)
A specific world w uniquely determines a single Φ so that φ∈Φ p(φ|ch(φ)) = 1. Note that the distribution p(ϕ|w) = φ∈Φ p(φ|ch(φ)) forms a polytree (or directed tree), as a logical expression is formed as a tree. From this Bayesian network, we define the factor graph over which we do the belief propagation. For brevity, we denote a specific ground atom P(o 1 , ..., o k ) as P O .
• There is a variable node φ and a factor node f φ for every subformula φ ∈ Φ.
• Let ϕ = ∀ x 1 , ..., x n φ be the top node. Denote the set of all instances of ϕ is M . Then f ϕ (ϕ, m 1 , ..., m |M | ) = I[ϕ = m∈M α m ] where e µ is the random variable corresponding to the instantiation of m in φ.
We ignore the other connectives as they can be formed from ¬ and ∧, both in classical logic as in DPFL. Next, we compute the messages in belief propagation. We start from the world variable nodes w P O and move up through the computation tree to ϕ. The messages for factors to variables are given as (Bishop, 2006) µ fs→x (x) = X f s (x, X) y∈ne(fs)\x µ y→fs (y) where ne(x) is the set of neighbours of node x. The messages for variables to factors are given as µ x→fs (x) = l∈ne(x)\fs µ f l →x (x). a. µ fw P O →w P O (w P O ) = η θ (P)(o 1 , ..., o k ) w P O · (1 − η θ (P)(o 1 , ..., o k )) 1−w P O , factor to variable for ground atom.
We wish to know what the marginal probability p(ϕ = 1|η θ ) is. A marginal of a variable φ in a factor graph is found as p(φ) = s∈ne(φ) µ fs→x (φ). The variable node ϕ only has the factor node f ϕ as a neighbor, so using (g.) we find 19 p(ϕ = 1|η θ ) ≈ m∈M µ αm→fϕ (1). (D.4) Next, we use induction to proof that the computation of µ fα m →αm (1) is equal to Differentiable Product Fuzzy Logic.
Using Equation D.4 we then find that p(ϕ = 1|η θ ) ≈ m∈M e θ (φ m , m), which is equal to the Differentiable Product Fuzzy Logic computation of the universal quantifier in Equation 11. Importantly, as the computation of the logic is itself a tree, the only loops are caused through ground atoms appearing in multiple subformulas. Therefore, when each ground atom only appears in a single formula, Differentiable Product Fuzzy Logic computes the same probability as Semantic Loss.
Here, box nodes correspond to factor nodes and circle nodes correspond to variable nodes. As γ is the top formula, this is where the messages get passed to. Note that there is a single loop, which is present because the atom P is used twice in the formula. This causes two incorrect messages: µ vw P →f P1 and µ vw P →f P2 . The first is incorrect as it does not have access to the incoming message µ f P2 →vw P and puts it to 1.