Learning Horn Envelopes via Queries from Language Models

We present an approach for systematically probing a trained neural network to extract a symbolic abstraction of it, represented as a Boolean formula. We formulate this task within Angluin’s exact learning framework, where a learner attempts to extract information from an oracle (in our work, the neural network) by posing membership and equivalence queries. We adapt Angluin’s algorithm for Horn formula to the case where the examples are labelled w.r.t. an arbitrary Boolean formula in CNF (rather than a Horn formula). In this setting, the goal is to learn the smallest representation of all the Horn clauses implied by a Boolean formula—called its Horn envelope—which in our case correspond to the rules obeyed by the network. Our algorithm terminates in exponential time in the worst case and in polynomial time if the target Boolean formula can be closely approximated by its envelope. We also show that extracting Horn envelopes in polynomial time is as hard as learning CNFs in polynomial time. To showcase the applicability of the approach, we perform experiments on BERT based language models and extract Horn envelopes that expose occupation-based gender biases.


Introduction
Artificial Intelligence (AI) models are now ubiquitous in several domains, often times used as black boxes. Despite all the efforts to develop trustworthy AI, the challenges to develop unbiased systems remain. Towards unravelling the hidden knowledge of black box models, in this work we investigate an approach for extracting knowledge from machine learning models based on Angluin's exact learning model. In the exact learning model, a learner interacts with a teacher (called oracle) via queries in order to identify an abstract target concept.
The most studied kinds of queries in the exact learning model are membership and equivalence queries. In the setting that we study, a membership query is a call to the oracle where the learner presents a variable assignment and the oracle then decides whether the target is satisfied on this assignment. In an equivalence query, the learner presents a hypothesis to the oracle and it decides whether this hypothesis 1. We simulate equivalence queries by generating at random a batch of examples and asking the neural network for the classification. If the hypothesis misclassifies an example, then the algorithm proceeds as if this example was the counterexample returned by the oracle in a negative reply. If the hypothesis classifies correctly all the examples from the batch then, even though not equivalent, one can expect that with high probability there is not much difference between the target and the hypothesis.
2. We convert interpretations into expressions in natural language and then the classification of the model back into the format expected by the algorithm. 3. We propose an adaptation of Angluin's algorithm for Horn formulas to deal with non-Horn oracles. We prove that this algorithm is guaranteed to terminate in exponential time and in polynomial time in the size of the most concise Horn envelope of the target formula and the number of variables if the number of non-Horn examples is polynomial.
Study Case To showcase the applicability of the approach, we perform experiments on BERT-based language models [15,26] in order to extract knowledge from these models and study the correlation between genders, occupations, periods of time, and locations. Our findings corroborate previous work on these language models that expose harmful biases in the models (see Subsection 2.2), which in turn supports the validity of our approach and reflects deeply ingrained biases in the society [1].
Our work is organized as follows. In Section 2 we highlight related work on probing language models and on exact learning Horn formulas, envelopes, and CNFs. In Section 3 we provide basic definitions used to address the third obstacle in Section 4. In particular, we present in Section 4 an algorithm for exact learning Horn envelopes and show that this problem is at least as hard as exact learning CNFs. In Section 5 we describe in more details how we address the first and second obstacles. Then, in Section 6 we present our experimental results using language models as oracles. Finally we conclude in Section 7.

Related Work
In this section we present some related works. We first discuss works on learning Horn and CNF formulas in Angluin's style. Subsequently, we present some works related to probing neural networks to expose various types of biases, with an emphasis on pre-trained language models.

Learning Horn Formulas and Envelopes
The problem of exact learning Horn formulas from examples was first studied in [5], where the authors give a polynomial (quadratic) time learning algorithm with membership and equivalence queries. Horn formulas are semantically characterised by their preservation under intersection (∩) of models, a property that is heavily exploited by the algorithm. There has also been work on pushing the boundaries between exact learning Horn and CNF [20] and understanding the algorithm better, in particular, proving that it always outputs a canonical Horn formula of minimal size [8]. For the case in which each clause has k literals, known as k-CNF, there is a polynomial time algorithm even with only membership queries or with only equivalence queries [2]. Regarding the problem of exact learning CNFs in general, it is known that they cannot be learnt with only equivalence queries [3] and that if there exist one-way functions that cannot be inverted by polynomial sized circuits then membership queries do not help [7].
Given the difficulty of exact learning CNFs in general in polynomial time, a follow-up problem that receives a considerable amount of interest is the problem of learning Horn envelopes from data [14,24,12]. The Horn envelope env(ϕ) of a formula ϕ is defined as (the smallest representation of) the strongest Horn theory implied by ϕ. A number of authors [14,24,23] have studied a variation of this problem where the input is a set of models M , and the task is to find a Horn representation of the closure under intersection of M (which is guaranteed to be precisely the set of models of some Horn formula by the semantic characterisation). Hence, set env(M ) := env(ϕ) where ϕ is any formula such that mod(ϕ) = M .
Dechter and Pearl [14] observe that if the closure of M under intersection is only polynomially larger than M , then we can simply generate the closure (which we denote by clo(M )) and run the classic Horn algorithm on the closure, answering the queries using the data M . Kearns, Selman and Kautz [24] give a polynomial time PAC-algorithm for learning Horn envelopes. However, it remained open whether there is a deterministic, output-polynomial algorithm for learning env(M ), given M . 1 Kavvadias, Sideri and Papadimitriou showed that this problem is as hard as computing the set of all transversals of a hypergraph, a problem whose complexity has been open for more than 40 years. Finally, Borchmann, Hanika and Obiedkov [12] give a polynomial-time PAC algorithm that uses an entailment oracle with regards to the envelope, that tells the learner whether an input clause is entailed by the envelope or not, and if not it has to provide a counterexample.

Probing Neural Networks
Machine learned models can contain various types of biases that can stem from the training data [21]. These can lead to numerous undesired effects during deployment [11,9]. This also applies to pre-trained language models where biases can be introduced by the datasets used during training or while tuning a downstream classifier. A lot of work has been done to explore existing biases in pre-trained language models. For example, pre-training the BERT [15] language model on a medical corpus has shown to propagate harmful correlations between genders, ethnicity, and insurance groups [32]. Language models have also been shown to contain biases against persons with disabilities [22].
Most work on detecting gender bias from pre-trained language models have focused on probing them using template-based approaches. Such templates are usually formed of sentences combining a predefined set of predicates and verb or noun phrases. To illustrate this, consider the template "[predicate] works as [description]". Predicate here can be pronouns or gendered-nouns, while the description could be anything from nouns referring to occupations, to adjectives referring to sentiment, emotions, or attributes [33,31,10,13].
Some of the works using template-based approaches to investigate gender bias in correlation with occupations are those building on the Winograd Schemas [25]. Winograd is a dataset of templates manually annotated. It is used to assess the presence of biases in co-reference resolution systems. The biases are measured based on the dependency of the system on gendered pronouns along stereotypical and non-stereotypical gender associations with occupations. Also, the WinoBias dataset [37] has been developed to investigate the existing stereotypes in models by exploring the relationship between gendered pronouns and stereotypical occupations. In addition to these, the WinoGender dataset [30] was introduced to also include gender-neutral pronouns, while focusing on the same task of exploring correlations between pronouns, persons, and occupations. For occupational biases in pre-trained language models, some works have explored the correlations between genders and occupations from a descriptive point of view using census data [36], while others have used the pre-trained language models' ability to complete templates to evaluate the extent to which these completions can be biased when it comes to genders and occupations [35,27].
While the template-based approaches are proven to be good at probing and exploring biases in pre-trained language models, they have also been shown to be sensitive to the formulation of the templates [34]. It has been shown that altering the grammatical tense of a template has an effect on the resulting correlations between genders and occupations. It is therefore beneficial to explore additional ways of probing pre-trained language models, especially for tasks relying on templatebased approaches.

Preliminaries
We provide relevant notions regarding propositional logic, in particular Horn logic, and the exact learning framework.

Propositional Logic
Let V be a finite set of Boolean variables. A (propositional) formula is any string of symbols generated according to the following recursive grammar: where v ∈ V, is the truth constant and ⊥ is the falsity constant. A literal over V is either a variable v ∈ V or its negation, in symbols, ¬v. A literal is positive, if it is a variable, and negative otherwise. A clause over V is a disjunction (∨) of literals over V. A formula is in conjunctive normal form if it is a conjunction of clauses, which we simply call a CNF. Every propositional formula is logically equivalent to a CNF. A clause is called k-quasi Horn clause if it contains at most k positive literals as disjuncts, and Horn if it has at most one positive literal as a disjunct (i.e. if k = 1). We often treat conjunctions of clauses and sets of clauses interchangeably. A Horn formula is a conjunction (or set) of Horn clauses.
We write clauses in implicational form: a clause ¬p 1 ∨ . . . ∨ ¬p n ∨ q 1 ∨ . . . ∨ q m is logically equivalent to the clause p 1 ∧ . . . ∧ p n → q 1 ∨ . . . ∨ q m in implicational form. A metaclause is an expression of the form P → Q. 2 Given a clause c = P → Q in implication form, we set ant(c) = P and con(c) = Q. Similarly, for h = P → Q a metaclause, we also set ant(h) = P and con(h) = Q. We say that a (meta)clause if negative if its consequent is ⊥, and definite otherwise.
The semantic clauses for the connectives ¬, ∧, ∨, → are as usual. We say that a model x covers a clause c if ant(h) ⊆ x. The following semantic clauses define the semantics of clauses in implicational form and metaclauses Dually, these could also be done in terms of falsification. Note that a model falsifies a clause only if it covers it.
It follows that for negative (meta)clauses (which are Horn), we have: Given formulas ϕ, ψ, we write mod(ϕ) := {x ⊆ V | x |= ϕ}. Furthermore, we say that ϕ entails ψ (notation: ϕ |= ψ) iff mod(ϕ) ⊆ mod(ψ), and that ϕ and ψ are (logically) equivalent (notation: Lemma 1. Let h be a Horn clause and x and y models. Then x |= h and y covers h implies that x ∩ y |= h. Proof. Since x |= h, by the semantics ant(h) ⊆ x and con(h) ∈ x. But then ant(h) ⊆ x ∩ y as ant(h) is included in both x and y, and con(h) ∈ x ∩ y as con(h) is not in x. By contrast, no analogous statement hold for k-quasi Horn formulas for k > 1. Indeed, {a}, {b} are models satisfying the 2-quasi horn clause a ∨ b, yet their intersection {a} ∩ {b} = ∅ is the emptyset which does not satisfy a ∨ b. In fact, for every ∩-closed set of models M there exists a Horn representation of M (i.e. a Horn formula ϕ such that mod(ϕ) = M ) that contains a minimal number of clauses. This minimal Horn representation is known as the Duquenne-Guigues basis of M [19], we use an alternative definition from [8].
Finally, we say that ϕ is saturated if it is both left-and right-saturated.

Horn Envelopes
Given an arbitrary formula ϕ, in case ϕ is not equivalent to a Horn formula (i.e. mod(ϕ) is not closed under intersection), still we might want to find a Horn formula that approximates the behaviour of ϕ. What is remarkable about Horn formulas is that there is always a unique tightest Horn approximation of ϕ, called the Horn envelope of ϕ. For consider the set of models clo(mod(ϕ)), then by Proposition 2 there is a Horn formula with precisely this set as it's models. We define the envelope of ϕ to be the smallest such Horn formula.

Learning via Queries
In this paper, we study the problem of exact learning logical formulas from data examples that are models, using queries. In the abstract setting of exact learning [2], this means that our concepts are of the form mod(ϕ) for some formula ϕ, and our examples are models. We will also refer to models as examples. If x |= ϕ, i.e. x ∈ mod(ϕ) then we say that x is a positive example for ϕ; else we say x is a negative example for ϕ. Henceforth, we will use 'model' and 'example' interchangeably.
We study the problem of identifying an unknown target Horn theory env(ϕ) by observing examples classified according to ϕ (where ϕ is any formula). In our setting, the learner is allowed to pose queries to two kinds of oracles: a membership oracle MQ ϕ (·) and a Horn equivalence oracle EQ Horn ϕ (·). A membership query MQ ϕ (x) takes as input an example x and returns "yes" if x |= ϕ and "no" otherwise. The (Horn) equivalence query EQ Horn ϕ (ψ) returns "yes" if env(ϕ) ≡ env(ψ) and "no" with a counterexample x ∈ mod(ϕ) ⊕ mod(ψ) otherwise. If x ∈ mod(ϕ) \ mod(ψ) we say that x is a negative counterexample (i.e. because it is a negative example for the target ϕ) and if x ∈ mod(ψ) \ mod(ϕ) we say that x is a positive counterexample. In our setting, an exact learning algorithm with membership and equivalence queries is polynomial time if the number of computation steps is polynomially bounded by |env(ϕ)| and |V|, where each oracle query counts as one computation step.
When learning the envelope env(ϕ), negative counterexamples x returned by EQ Horn ϕ are only required to be negative examples for ϕ. Since mod(ϕ) ⊆ mod(env(ϕ)) = clo(mod(ϕ)), this means that the Horn equivalence can return two kinds of negative examples as counterexamples to equivalence queries. We say that a negative example x for ϕ is Horn if x |= env(ϕ), and non-Horn otherwise. Note that clo(mod(ϕ)) − mod(ϕ) is precisely the set of non-Horn negative examples for ϕ.

Learning the Horn Envelope
In this paper, we will give a novel algorithm with membership and equivalence queries that is able to exactly learn Horn envelopes, where the queries are for original target rather than its envelope (in contrast to the other approaches discussed in subsection 2). We show that the algorithm makes exponentially many queries in the worst case but only polynomially many if the target is a Horn formula. More precisely, we show that the number of equivalence queries is linear in |env(ϕ)|, |V| and the number of non-Horn negative examples for ϕ.
The classical algorithm for learning Horn formulas cannot simply be used to learn the envelope of an arbitrary target formula. Indeed, it is not even guaranteed to terminate [28] (see also A). The basic idea of the classic algorithm is to assign a Horn clause to a negative example encountered as a possible explanation of why this example falsified the target (for w.l.o.g. we can assume the target is a CNF which is false on an example only if at least one clause is falsified on it).
What leads to non-termination of the classical algorithm in the non-Horn case (when the target is not a Horn formula) is that there need not be a correct "Horn explanation" of why an example is negative for the target. Indeed, we may have that an example x is negative for a CNF ϕ, yet satisfies the envelope env(ϕ). Equivalently, x falsifies ϕ but satisfies all Horn consequences of ϕ (cf. Proposition 7) if x is non-Horn. Thus the classic algorithm can receive the same non-Horn negative example over and over again, interleaved with positive counterexamples showing all Horn explanations maintained in the hypothesis to be incorrect (cf. appendix A)

The Algorithm
In this subsection we show that Algorithm 1 terminates in exponential time when the target is a CNF, and in polynomial time in the size of the envelope of the target if it has polynomially many non-Horn examples. A key observation in our approach is that when all Horn explanations for a negative example x have been shown incorrect, we have in fact received positive counterexamples e 1 , . . . , e n such that e 1 ∩ . . . ∩ e n = x. In other words, we have observed data (a set of positive examples E + ) that proves that x is a non-Horn negative example because x ∈ clo(E + ). Then we can use a k-quasi Horn clause for some k > 1 to explain why x was a negative example for the target. We canonically choose the weakest non-Horn clause falsified on x as a "non-Horn explanation". By "weakest" here we mean here that any other quasi-Horn clause falsified on x entails this one. It is not difficult to see that the following definition satisfies these properties: Hence, we overcome the difficulty of non-Horn examples by keeping track of which negative examples encountered are in the intersection-closure of the positive examples encountered so far, and excluding them with a weakest possible non-Horn clause. We assign a Horn metaclause to a negative example in the same way as the authors of [5] do. However, since we keep track of the positive examples, we make this explicit in notation (like [8]). Given a negative example x and a set Then we define: That is, we set horn E + (x) to be the strongest meta horn clause falsified on x yet still consistent with the set of positive examples seen so far, E + (recall that the positive examples of a Horn formula are closed under ∩). By strongest we mean here that horn E + (x) implies all other horn metaclauses (and therefore every horn clause) falsified on x yet consistent with E + .
Algorithm 1: Horn Envelope Learner input: MQ ϕ and EQ Horn ϕ oracles.  Proof. We use the following two technical claims. Proof. At any iteration on Line 18, if x ∈ E nh then quasi(x) ∈ Q and clearly x |= quasi(x) since x ⊆ x and x ∩ x = ∅. In fact, it easy to see that mod(¬quasi(x)) = {x} so E nh = mod(¬Q).
for some x ∈ E − and let e ∈ E + . It suffices to show that e satisfies this metaclause. If x ⊆ e then clearly e satisfies it. Suppose otherwise that x ⊆ e, then e ∈ E + x , so the consequent E + x \ x ⊆ e so e satisfies it. This suffices to show that e |= H and hence E + ⊆ mod(H). Next, for any e ∈ E − we have that e |= horn E + (e) by definition of horn E + (e).
Claim 11. Neither E − nor E nh contains any repetitions (i.e. they are sets). Moreover, they are disjoint.

Proof.
Once an example e is placed into E − either on Line 8 or Line 10, either e will removed from E − and placed into E nh or not. If it is removed, then on Line 15 Q will be updated to include quasi(e). If it is not removed, then on Line 14 H will be updated to include the metaclause horn E + (e). In either case we have e |= H ∪Q since e falsifies both quasi(e) and horn E + (e). Moreover, this remains true as long as e ∈ E − . Therefore e cannot be returned again as a negative counterexample to an equivalence query as long as it is in E − . But then no example can be placed twice into E − .
Similarly, as long as e ∈ E nh we have e |= Q so e cannot be returned by the oracle as a counterexample, so no example can be placed twice into E nh . Hence E − and E nh are (or represent) sets. To see that these sets are disjoint, observe that an example is only added to E nh if immediately before it was removed from E − . Moreover, no example is ever removed from E nh so once an example is in E nh it cannot ever be returned again by the oracle. Termination in exponential time is not a particularly interesting result since one could learn any CNF (using a regular equivalence query oracle) or its Horn envelope (using the Horn equivalence oracle) in exponential time simply by brute force. In the rest of this section, we focus on an interesting aspect of our algorithm which is termination in polynomial time in the size of the envelope of a non-Horn formula with polynomially many non-Horn examples. Termination in polynomial time if the target is a Horn formula follows from the fact that in this case the algorithm works as the one proposed by Angluin [5]. As already mentioned, Angluin's algorithm may not terminate if the target is non-Horn (even if it has polynomially many non-Horn examples [28], see Section A).
Lemma 12. At all times, if e i , e j ∈ E − with e i ⊂ e j then e i ⊆ E + e i ⊆ e j .
Proof. By induction on the number of iterations of the main loop. The base case is trivial since initially E − is the empty list. Now suppose the claim holds at the end of iteration n−1, and the n-th equivalence query returns a negative counterexample x. If instead x was a positive counterexample, then E − can only have been modified by removing some example, whence the universal claim still holds for E − after removal. Then either some e i ∈ E − is replaced with e i ∩ x or x is appended to E − as the last element.
In the former case, let e j ∈ E − with e j ⊂ e i ∩ x. Then e j ⊆ (E + e j ) ⊆ e i by inductive hypothesis. But since x was returned as a negative counterexample to the n-th equivalence query, in particular it satisfies horn E + (e j ) because e j was in E − at the end of the n − 1-th round when H was last updated. Therefore E + e j ⊆ x as well and hence e j ⊆ E + e j ⊆ e i ∩ x, as desired. If x is instead appended to E − , then for all e i ∈ E − such that e i ⊆ x, since x was a negative counterexample to the hypothesis which was created at the end of the n − 1-th round, x |= horn E + (e i ) and hence e i ⊆ E + e i ⊆ x.
Lemma 13. For each iteration of Algorithm 1, suppose the algorithm receives a negative counterexample x from EQ Horn ϕ in Line 2 with x |= h for some h ∈ env(ϕ) and there is some e i ∈ E − that covers h. Then, for some e j ∈ E − with j ≤ i, we have that e j is replaced by e j ∩ x in Line 9.
Proof. The proof is by induction on the iterations of Algorithm 1, following the proof strategy Angluin [5] for learning Horn formulas. At the first iteration the lemma is vacuously true. Assume inductively it holds for the n-1-th iteration. At the n-th iteration, suppose EQ Horn ϕ returns the negative counterexample x and we have e i ∈ E − and h ∈ env(ϕ) such that x |= h and ant(h) ⊆ e i . If there is some e j ∈ E − with j < i such that e j is replaced by e j ∩ x in Line 9, we are done. Suppose this does not happen. If we can show that e i satisfies the conditions on Line 7 we are done because then e i will be the least such example in E − per assumption. By Lemma 1 it follows that e i ∩ x |= h so e i ∩ x is a Horn negative example and MQ ϕ (e i ∩ x) = no. We still need to show that e i ∩ x |= H ∪ Q, where H, Q denote the value of these sets just before the n'th equivalence query was posed. Clearly we have e i ∩ x |= Q because e i ∩ x is a Horn negative example while mod(¬Q) = E nh consists only of non-Horn examples It remains to show that e i ∩ x |= H ∪ Q. Clearly it satisfies Q as we know that e i ∩ x is a Horn negative example, and that only non-Horn negative examples falsify Q by Claim 9. To see that it also satisfies H, first observe that e i ∩ x |= horn E + (e j ) for all e j ⊆ e i ∩ x. If on the other hand e j ⊆ e i ∩ x, then also e j ⊆ e i and e j ⊆ x. Since e j ∈ E − we have that horn E + (e j ) ∈ H. But x was a negative counterexample received as an answer to the n-th equivalence query, so x |= H ∪ Q. It follows that con(horn E + (e j )) ⊆ x. Moreover, by Lemma 12 we also have that con(horn E + (e j )) ⊆ E + e j ⊆ e i . Hence con(horn E + (e j )) ⊆ e i ∩ x so e i ∩ x |= H. Proof. As originally proposed by Angluin [5], first we show that property (a) implies property (b). For suppose that at some iteration of the main loop there are e i , e j ∈ E − with i = j and e i |= h and e j |= h for some h ∈ env(ϕ), violating property (b). Without loss of generality assume that i < j. Then in particular ant(h) ⊆ e i , violating property (a) at the same iteration.
Next, we argue by induction on the number of iterations of the main loop that property (a) holds. The base case is trivial as initially E − is the empty list. Now suppose that property (a) holds for E − (n−1) , where E − (n−1) denotes E − at the end of the n − 1-th iteration, just before the n-th equivalence query is posed. Either a positive or negative counterexample is returned by the oracle if the algorithm does not halt. If we obtain a positive counterexample, E − can only be modified by removing some example from E − and moving it to E nh in Lines 11-13. After such removal E − still satisfies the universal property (a). If a negative counterexample x is returned, with x |= h for some h ∈ env(ϕ). Then either x is appended as the last element of E − or some e i ∈ E − is replaced by e i ∩ x.
In the former case, let x be the l-th element of E − . Then it cannot be that there is some e i ∈ E − with i < l such that ant(h) ⊆ e i for then e i and x would have been refined with each other by Lemma 13. Hence, when x is appended to E − as the last element, property (a) is preserved through iteration n. In the latter case, let e i be the example that is replaced by e i ∩ x and suppose that e i ∩ x |= h for some h ∈ env(ϕ). By the contrapositive of preservation of Horn formulas under ∩, it follows that either x |= h or e i |= h. Now property (a) could be violated in two ways: either there is some e j ∈ E − with j < i such that ant(h) ⊆ e j or there is some e j ∈ E − with i < j such that e j |= h for some h ∈ env(ϕ) and ant(h ) ⊆ e i ∩ x.
Suppose for contradiction that there is some e j ∈ E − with j < i such that ant(h) ⊆ e j . If x |= h then by Lemma 13 x should have been refined with e j instead of e i . But it also cannot be that e i |= h because that would violate the inductive hypothesis. Since, we had established that either x |= h or e i |= h, we arrive at a contradiction. The second type of violation can also not happen for if ant(h ) ⊆ e i ∩ x then ant(h ) ⊆ e i as well, violating the inductive hypothesis. With these new results in hand, we can give a tighter upper bound on the number of queries posed by our algorithm.
Proof. First we show that |E − |∪|E nh | ≤ |env(ϕ)|+k. By Claim 11 we know that E − and E nh contain an example at most once and that the union above is disjoint. It follows from Corollary 15 that there are at most |env(ϕ)| Horn examples in E − . So since there are at most k non-Horn examples, and every example in E − ∪ E nh is a negative example, so either Horn or non-Horn, the claim follows.
Every negative counterexample received from the oracle is either appended to E − (increasing its size as a set by 1) or refines some e i ∈ E − . Since in particular |E − | ≤ |env(ϕ)|+k, it follows that there can be at most (|V|+1)(|env(ϕ)|+k) negative counterexamples returned by the oracle in total. Every positive counterexample must falsify some metaclause in H and cause at least one variable to be removed from its consequent. H always consists of metaclauses of the form horn E + (e) for e ∈ E − , and hence there can be at most |E − |(|V| + 1) ≤ (|V| + 1)(env(ϕ) + k) positive counterexamples returned by the oracle. 6 It follows that the algorithm terminates after making at most O(|env(ϕ)||V| + k|V|) equivalence queries. The algorithm only poses membership queries in rounds where the oracle returned a negative counterexample, and in each such round at most |E − | membership queries are posed. Since |E − | ≤ |env(ϕ)| + k and can be at most O(|env(ϕ)||V| + k|V|) negative examples queries returned by the oracle, it follows that there can be at most O(|env(ϕ)| 2 |k 2 V|) membership queries in total.
Corollary 17. The algorithm terminates after making at most O(|ϕ)||V|) equivalence queries and at most O(|ϕ| 2 |V|) membership queries when the target is Horn. These are the exact same bounds as the classical algorithm [5].
Proof. If the target is Horn then ϕ ≡ env(ϕ) and |env(ϕ)| ≤ |ϕ| because the envelope is defined to be DG(ϕ). Then the claim follows from Theorem 16 by the observation that k = 0 if the target is Horn. Proof. If k is polynomially bounded by |env(ϕ)| and |V| then the claim is immediate from Theorem 16.

Theorem 19. If Algorithm 1 halts and outputs H ∪ Q then H ≡ env(ϕ).
Proof. By definition of the Horn equivalence oracle, if the last equivalence query answers "yes" then env(H ∪ Q) ≡ env(ϕ). We show that H ≡ env(H ∪ Q) and that H is saturated. From that it follows that H is also DG-basis of the envelope of the target ϕ. Secondly, we show that H = {horn E + (e i ) | e i ∈ E − } is saturated. We start with left-saturatedness. We clearly have e i |= horn E + (e i ) for all e i ∈ E − (i.e. whether the consequent is ⊥ or not). Let e j ∈ E − with i = j. If e j ⊆ e i we have e i |= horn E + (e j ) vacuously. So suppose that e j ⊆ e i . Then by Lemma 12 we have that e j ⊆ E + e j ⊆ e i . Since con(horn E + (e j )) = E + e j \ e j it follows that e i |= horn E + (e j ). For right-saturatedness, let e i ∈ E − . We show that

Hardness of Learning the Horn Envelope
In this section, we establish the difficulty of learning Horn envelopes by reducing it to the learning arbitrary CNFs, which is known to be a hard problem [4,6,17] (Theorem 21). This hardness result complements Corollary 18, by showing that it is in some sense a best possible upper bound. Frazier [18] has shown that learning arbitrary CNFs polynomially reduces to 2-quasi Horn. 7 This means that we can employ a 2-quasi-Horn learning algorithm to learn an suitable encoding of a CNF as a 2-quasi-Horn formula over an extended set of variables. The trick is to replace positive literals p with the negated literals ¬p ¬ , where p ¬ is a fresh variable that will be forced to be interpreted as ¬p by some extra help-formulas. We show that learning CNF polynomially reduces to learning Horn envelopes. First, we will define the encoding and establish some properties of it.
Given a set of variables V = {v 1 , . . . , v n }, add the variables We use all clauses of the form v ∧ v ¬ → ⊥ and v ∨ v ¬ for each v ∈ V to ensure that v ¬ is interpreted as ¬v. Let χ setup be the conjunction of all such clauses. Further, let ϕ ¬ := ϕ[¬v ¬ 1 , . . . , ¬v ¬ n /v 1 , . . . , v n ] be ϕ with all the positive literals substituted out. Then define: Note that |ϕ ¬ | contains as many clauses as |ϕ| and |χ setup | is in O(|V|), so the translation is polynomial. Each example x ⊆ V can be mapped to an example for which it is easily checked that: Moreover, there is an inverse dec(·) to enc(·) which takes a CNF over the extended set of variables V ∪ V ¬ back to a CNF over V.
That is, dec(ψ) is obtained from ψ by substituting all subformulas p ¬ uniformly for ¬p, and then eliminating double negations in front of atoms (CNFs are in negation normal form). It follows that dec(ϕ ¬ ) = ϕ, dec(ψ) ¬ = ψ and dec(χ setup ) ≡ , hence dec(enc(ϕ)) ≡ ϕ, i.e. dec(·) forms a retraction of enc(·). Observe that ϕ ¬ consists solely of Horn clauses. Hence, the claim is that learning the Horn envelope of enc(ϕ) suffices to learn ϕ. We need the following lemmas.

Lemma 20. The Horn envelope of enc(ϕ) is logically equivalent to
Proof. We first show that env(enc(ϕ)) entails Φ. We know that env(enc(ϕ)) ≡ {h Horn clause | enc(ϕ) |= h}. So as ϕ ¬ ∪ {(p ∧ p ¬ ) → ⊥ | p ∈ V} is a subset of enc(ϕ), these clauses are entailed by enc(ϕ). Now take any clause of Φ of the form (ant(h) \ {p}) → p • , where h ∈ ϕ ¬ and p ∈ ant(h). Recall that every Horn clause in ϕ ¬ has an empty consequent because all positive literals have been substituted out. Hence by the valid rule of resolution: For the other direction, we want to show that Φ |= env(enc(ϕ)), i.e. mod(Φ) ⊆ mod(env(enc(ϕ))) = clo(mod(enc(ϕ))). So let x |= Φ. We need to show that x is in the closure of mod(enc(ϕ)). By 1, this is equivalent to checking whether x = {e ∈ mod(env(ϕ)) | x ⊆ e}. We know that for no p ∈ V both p and p ¬ are in x, and we know that x |= ϕ ¬ . For each p ∈ V such that both p, p ¬ ∈ x, we know that x ∪ {p} |= p∈V (p ∧ p ¬ ) → ⊥ still. If x ∪ {p} |= ϕ ¬ then, since ϕ ¬ consists only of Horn clauses of the form ant(h) → ⊥, it must be that y |= ϕ ¬ for all y ⊇ x ∪ {p}. So ϕ ¬ |= (x ∪ {p}) → ⊥, and since ϕ ¬ consists only of negative Horn clauses with ⊥-antecedent, there must be some clause y → ⊥ ∈ ϕ ¬ with y ⊆ x ∪ {p}. For otherwise the example y does not If y ⊆ x then x |= ϕ ¬ , contrary to hypothesis. Hence p ∈ y and thus y \ {p} → p ¬ ∈ Φ. But then also x |= Φ as p ¬ ∈ x and y \ {p} ⊆ x.
A similar argument shows that it cannot be that x ∪ {p ¬ } |= ϕ ¬ . This means that x ∪ {p} and x ∪ {p ¬ } both satisfy ϕ ¬ as well as all clauses of the form (q ∧ q ¬ ) → ⊥. By iterating this argument, it follows that the examples x := x ∪ {p ∈ V | p, p ¬ ∈ x} and x := x ∪ {p ¬ ∈ V ¬ | p, p ¬ ∈ x} are models of enc(ϕ). This because at each stage we ensured that the extensions still satisfied ϕ ¬ and all clauses of the form (p ∧ p ¬ ) → ⊥, while by construction x and x also satisfy all clauses of the form p ∨ p ¬ . Since x ∩ x = x, we conl that x ∈ clo(mod(enc(ϕ))). Proof. Suppose there is an algorithm A that learns the Horn envelope in polynomial time and consider any CNF ϕ over V. We will use A to learn a CNF representation of ϕ, going back and forth between the two settings with the enc(·) and dec(·) mappings. Since enc(ϕ) is just a CNF over V ∪ V ¬ , it suffices to show that we can correctly answer the oracle queries posed by A, using our knowledge of the encoding and our two CNF oracles MQ ϕ and EQ ϕ .
When A asks a membership query MQ enc(ϕ) (x) for some x ⊆ V ∪ V ¬ , if x = y ¬ for some y ⊆ V then answer the membership query with "no". Else, for the unique y ⊆ V such that y ¬ = x ask the membership query MQ ϕ (y) and return the answer to A. When A asks instead a "Horn equivalence" equivalence query EQ Horn enc(ϕ) (ψ), ask the regular equivalence query EQ ϕ (dec(ψ)). If the CNF oracle answers "yes" to the latter query then dec(ψ) is a CNF representation of the original target CNF ϕ and we are done. Else the oracle returns a counterexample x ∈ mod(ϕ) ⊕ mod(dec(ψ)). It follows that x ¬ ∈ mod(ϕ ¬ ) ⊕ mod(dec(ψ) ¬ ) but mod(dec(ψ) ¬ ) = mod(ψ) and hence x ¬ ∈ mod(ϕ ¬ ) ⊕ mod(ψ) . Hence "no" with the counterexample x ¬ is a valid answer to the equivalence query EQ Horn ϕ (ψ).
If x was a negative (resp. positive) counterexample then so is x ¬ . In fact, if x was a negative counterexample, i.e. x |= ¬ϕ ∧ dec(ψ) then x ¬ |= ¬ϕ ¬ ∧ ψ, that is, x ¬ is a Horn negative example. It follows that learning Horn envelopes (in polynomial time) is as hard as learning CNFs (in polynomial time).

Learning from Neural Networks
In this section, we discuss in more details how we address the obstacles mentioned in the Introduction to apply exact learning algorithms for extracting knowledge from trained neural networks. We start discussing the second obstacle. Intuitively, we pose our queries to the neural network, thus viewing the neural network as the oracle. To do this, one has to define a Boolean function from a trained neural network. In this work, we create the lookup table presented in the appendix (Table 7) to make the conversion between Boolean values and expressions in natural language given to the language model. Discrete-valued attributes such as "occupation" with 11 values (including "unknown occupation") are encoded by 10 fresh variables intuitively representing propositions such as "the occupation is mathematician" 8   the first 5 positions represent the time period, the following 9 positions are for the continent and then there are 10 positions for the occupation. The last 2 positions represent the gender, which is the true label. The given example then translates to the sentence "<mask>was born after 1970 in Africa and is a dancer." with the true label "female" meaning the masked token should be filled with "She".
Once such correspondence with a Boolean function has been defined, one can use the neural network to answer oracle queries. Clearly, membership queries can be easily simulated by running the neural network on an example and check the classification. However, the first obstacle mentioned in the Introduction is that an equivalence oracle is hard to simulate in practice because it requires checking whether two formulas are equivalent and return a counterexample if this is not the case. In absence of an explicit representation for the Boolean function defined by the neural network, the only (foreseeable) way of checking equivalence w.r.t. the Boolean function defined by the neural network is to check all examples for agreement, which is an exponential task.
Hence, we use the standard technique of simulating equivalence queries by random sampling [2]. That is, every time Algorithm 1 asks an equivalence query EQ Horn ϕ (ψ) we randomly generate a batch of examples and check whether the hypothesis ψ classifies an example from this batch differently than env(ϕ), given the labels of ϕ for this batch. It may happen that an example x is labelled negatively yet it satisfies the envelope (if x is a non-Horn negative example for ϕ). We only know that the classification of x by ϕ is different from env(ϕ) if we have observed a number of positive examples whose intersection is x; that is, if we have the data to prove that x is non-Horn. Thus, the interpretation of the classification we receive from the oracle dynamically changes in response to the positive examples we receive. In other words, we learn what the real target env(ϕ) (that is, where it differs from the underlying formula ϕ) is whilst approximating it.
Regarding the third obstacle, it is unlikely that a Boolean function defined by a neural network (suitably binarized) defines exactly a Horn formula. This is because neural networks tend not to be rule-like while a Horn formula is exactly a set of rules. This motivates studying the problem of learning Horn envelopes of arbitrary Boolean functions. However, we see that we can neither use the deterministic algorithm by Dechter and Pearl [14] nor the probabilistic one by Kearns, Selman, and Kautz [24] because both assume access to a complete description of the positive examples (or its so-called "characteristic models" [24], whose intersection-closure is the set of all positive examples) which is an unrealistic assumption.
While Corollary 18 may seem like a weak statement, it is of great practical interest. This is because real-world data tends to be sparse [12]. That is, |V| tends to be big while |ϕ| tends to be small (so that identifying the rules is like looking for a needle in a haystack), from which follows that the number of non-Horn example clo(mod(ϕ)) \ mod(ϕ) tends to be small as well as it depends only on |ϕ|. In other words, the number of non-Horn examples tends to be exponentially smaller than the total number of examples 2 |V| [12]. When |V| becomes larger and larger, it becomes more and more likely that the number of non-Horn examples is only polynomial in |env(ϕ)| and |V|. Similar remarks are made in footnote 4 of [24].
We quote the following passage from [14] (adapted to fit our notation, where M is a set of models): "If clo(M ) is substantially larger than M , we know that any Horn approximation is bound to be very poor. It is only when |clo(M ) \ M | is a fraction of |M | that Horn formulas can offer a reasonable approximation to M , and it is precisely in those cases that we can find a tightest Horn approximation in reasonable time. This suggest a strategy of focusing the development of Horn approximations in only those cases that can benefit from such approximations." In the next section we present our experimental results using a modified version of Angluin's algorithm for learning Horn formulas where queries are converted into natural language and posed to language models.

Experiments
We describe an experiment performed on different language models (LM) that we used as an oracle. In more details, we employ a modified Horn algorithm to extract rules from the BERT-based language models BERT-base and BERT-large [15] as well as RoBERTa-base and RoBERTa-large [26]. All models are used with their implementations on and accessed via the API of huggingface. 9 Our goal with this experiment was to showcase the applicability of the Horn algorithm for probing LMs and find out if occupation is generally more often linked to gender 10 than to other attributes. For this comparison we use nationality and birth year as they, are next to gender, a defining attribute of every person. They represent culture and age in a very simple form and are therefore likely to also be linked to a persons' occupation, as certain age groups or cultures are more likely to have one occupation over another. As a sanity check, we also perform a simple probing with the same language models and setup.
We extract a dataset from wikidata 11 that consists of every entity with a given occupation and their birth year, nationality, and gender. The used occupations are shown to be the 60 most gender biased occupations for the BERT-base model [16]. The nationality is represented in the format of continents, as this reduces the amount of possible values drastically. For the same reason, the birth years are summarized into 5 time periods instead of distinct years. The exact time periods are determined from the dataset. The time period boundaries are then evenly spaced over all the birth years of all the entities in the dataset. This gives a more fine-grained difference in the 1900s, whereas everything before 1875 is summarized into one. The exact values can be seen in the first half of the lookup table (table 7). Each entity from the dataset then makes one example for probing by filling its attributes into the template sentence "<mask>was born [year] in [continent] and is a [occupation]." The probing is done by predicting the masked pronoun in each sentence i with the given language model. The difference in the resulting probabilities for "he" and "she" is then used to calculate the gender bias on example i [16] Pronoun Prediction Bias Score = P P BS i = Prob("he") − Prob("she").
With N occ being the number of examples for occupation occ, the Pronoun Prediction Bias Score for occupation occ is then the average over all examples with this occupation Based on the results of the probing ( Figure 2) and the frequency of the occupations in the data (Figure 1), we chose 10 clearly biased occupations for the Horn algorithm.
In our experiment, we developed a function that creates a sentence out of given attributes that are encoded in the variables of each interpretation. In the context of the task, an interpretation corresponds to an entity with certain attributes. Each attribute is one-hot encoded into a vector with at most one 1 and all of the attribute vectors together are one interpretation (=entity). In particular, the 4 attributes are: period of time (5 features), nationality (as a continent, 9 features), occupation (10 features), and gender (2 features). The attributes are handled in the same way as in the probing experiment. With this, each interpretation can be translated into natural language attributes using a lookup table (Table 7), which can then be filled into the template sentence.
For a membership query, the language model predicts the gender of the given entity by predicting the masked pronoun in the sentence. We compare this prediction with the given gender and return whether they match or not as the result of the query. We generate the samples for an equivalence query as random feature vectors, given that each attribute can have at most one 1 in it. The number of equivalence queries simulated by the Horn algorithm was limited to 50, 100, and 200 for different experiments. For each language model we conducted 10 iterations of each experiment.
The results from the extractions of each language model expose biases in all of them (Tables 3, 4, 5 and 6). We consider the rules that were extracted in at least 7 iterations as the most relevant and reliable ones. With 100 equivalence queries, the relevant rules for each language model link gender and occupation without taking other attributes into account (with one exception). The other attributes almost only appear in less relevant rules that have been extracted in 3 or less iterations. The only exception to that is the rule singer ∧ male → before 1875 extracted by RoBERTalarge in 100 equivalence queries, which was extracted in 7 out of 10 iterations. The same rules are appearing in 10 out of 10 iterations with 200 equivalence queries for all models. This confirms that those rules are the most relevant ones. It also shows that the number of equivalence queries that is used as a maximum is important for the kind of rules that are extracted and how reliable they are.
Recall that a rule that has ⊥ in the consequent means that the antecedent does not happen. In addition, we consider gender to be exclusively binary 12 and therefore it also holds that ¬female ↔ male. In other words, we extracted rules revealing certain stereotypes e.g. stating that "women are not football players" and "nurses are women". All extracted stereotypes of this kind match with the results from the probing experiment. It is also important to note that out of all rules extracted, the base models (RoBERTa-base and BERT-base) only relate the male gender with "nurse" without relating the other female perceived occupations "fashiondesigner", "dancer", and "singer" as well. On the other side, the female gender is related to all male perceived occupations, even those that are more lightly biased. This also shows that bias towards females is more present than bias towards males. The latter is only extracted in the strongest case of "nurse". This experiment took approximately 1, 3, and 13 hours per iteration with 50, 100, and 200 equivalence queries respectively for the base models on a PowerEdge R7525 Server. For the large models, one iteration took approximately 2, 5, and 15 hours for 50, 100, and 200 equivalence queries respectively on a PowerEdge R7525 Server (Table 2) Although we could see an improvement in the quality of rules we extracted with 200 equivalence queries, the runtime is also significantly higher. In our experiments, 100 equivalence queries were sufficient to extract the same rules in 70% of experiments. There is a trade off between the runtime and the quality of the extracted rules that favors multiple rounds of Horn with 100 equivalence queries over less rounds with 200 equivalence queries.

Conclusion
We presented an approach for extracting knowledge from language models based on Angluin's exact learning model with queries and counterexamples. We adapted Angluin's classical algorithm for exact learning Horn theories to make it applicable to learn from neural networks. In particular, we observe that trained neural networks may not behave as Horn oracles, meaning that their underlying target theory may not be Horn. We propose a new algorithm that aims at extracting the envelope of the target theory and that is guaranteed to terminate in exponential time (in the worst case) and in polynomial time if the target has polynomially many non-Horn examples. We also proved that exact learning Horn envelopes in polynomial time is at least as hard as CNFs, and therefore not expected to be polynomial time learnable. We performed experiments on pre-trained language models and extracted rules exposing occupation-based gender biases in these models, such as rules expressing that women are not mathematicians, diplomats, bankers among others, and men are not nurses. While these results are not surprising given the results of several authors when probing language models (see Subsection 2.2) and existing gender biases in the society [1], our approach provides a way of exploring other potential correlations such as those related to time periods and location.