A Novel Boolean Kernels Family for Categorical Data †

Kernel based classifiers, such as SVM, are considered state-of-the-art algorithms and are widely used on many classification tasks. However, this kind of methods are hardly interpretable and for this reason they are often considered as black-box models. In this paper, we propose a new family of Boolean kernels for categorical data where features correspond to propositional formulas applied to the input variables. The idea is to create human-readable features to ease the extraction of interpretation rules directly from the embedding space. Experiments on artificial and benchmark datasets show the effectiveness of the proposed family of kernels with respect to established ones, such as RBF, in terms of classification accuracy.


Introduction
Large-margin kernel machines (e.g., SVM) are recognized state-of-the-art algorithms in machine learning applications. They are broadly applied to several domains, such as text categorization, spam filtering, RNA function prediction, and so on. However, since these methods typically work on an implicitly defined feature space by resorting to the well-known kernel trick, the interpretability of the resulting model is difficult.
This last aspect is often crucial in specific application areas, such as the medical ones, in which the simple predictive answer is not enough. Being this a requirement for the acceptance of these black-box models by end users, in the last decade, several methods have been introduced for rule extraction from SVMs (see [1] for a recent survey). The majority of the proposed approaches try to extract if-then rules over the input variables and this task is generally hard when common kernels, e.g., the polynomial and RBF ones, are used.
On the other hand, Decision Trees (DT), thanks to their easy logical interpretation, are very appreciated, especially by non-expert users. The shortcoming of DTs is that, in general, they are not as accurate as more complex methods. In the case of binary valued input data, an alternative approach to make SVM more interpretable consists in defining features that are easy to interpret, for example, features that are propositional (i.e., logical) formulas applied to the input vectors. In particular, Boolean kernels are kernel functions in which the binary input vectors are mapped into an embedding space formed by propositional formulas of the input variables, and, in such space, the dot product is performed.
More formally, a Boolean kernel function κ : {0, 1} n × {0, 1} n → N is defined as κ(x, z) = φ(x), φ(z) where φ : {0, 1} n → {0, 1} N is the embedding Boolean function with x, z ∈ {0, 1} n . When the input space is binary, the linear kernel can be seen as a limit case of a Boolean kernel in which the features simply represent the Boolean literals themselves.
In this paper, we propose a new family of Boolean kernels able to produce feature spaces composed by arbitrarily complex propositional formulas. In particular, we first introduce the basic Boolean kernels [2], namely the conjunctive kernel and the disjunctive kernel, for both the monotone and non-monotone case. On top of these kernels, we then propose more complex kernels such as the Disjunctive Normal Form kernel and the Conjunctive Normal Form kernel. For all the proposed kernels, an efficient method to compute them is provided. We assess the quality of the proposed kernels in terms of classification accuracy on several categorical datasets, and compare their performance against state-of-the-art kernels, such as the RBF kernel, and other Boolean kernels proposed in the literature.
The reminder of this paper is structured as follows: in Section 2, we give an overview of the existing work related to Boolean kernels. In Section 3, we present the proposed Boolean kernels family and, in Section 4, we discuss their computational complexity. Finally, Section 5 shows all the performed experiments on several benchmark categorical datasets.

Related Work
Sadohara [3] was the first to introduce the idea of Boolean kernel. In that work, the concept of Boolean kernel is actually related to a single kernel called DNF kernel. Specifically, he proposed a SVM for learning Boolean functions: since every Boolean (i.e., logical) function can be expressed in terms of Disjunctive Normal Form (DNF) formulas, the proposed kernel creates a feature space containing all possible conjunctions of negated or non-negated Boolean variables.
For instance, the feature space for a two variables, e.g., x 1 , x 2 , DNF contains the following 3 2 − 1 features: The resulting decision function of a kernel machine which employs the DNF kernel can be represented as a weighted linear sum of conjunctions (Representer Theorem [4,5]), which in turn can be seen as a kind of "soft" DNF.
Formally, the DNF kernel between x, z ∈ R n is defined as while its monotone (i.e., without negations) form is the following By restricting the domain of the vectors in {0, 1} n , the computation of the kernels is simplified as follows where x = 1 n − x is the vector with all the binary entries swapped. Note that the sum x, z + x, z simply counts the number of common "bits" between x and z. These last two kernels were independently discovered in [6,7]. A drawback of this type of kernels is the exponential growth of the size of the feature space with respect to the number of involved variables, i.e., 3 n − 1 for n variables. To give the possibility of controlling the size of the feature space, Sadohara et al. [8] proposed a variation of the DNF kernel in which only conjunctions with up to d variables (i.e., d-ary conjunctions) are considered. Over binary vectors, this kernel, dubbed d-DNF kernel, is defined as and trivially, if d = n, κ d dnf (x, z) = κ dnf (x, z). A nice property of the d-DNF kernel is that it yields a nested sequence of hypothesis spaces, i.e., H 1 ⊆ H 2 ⊆ · · · ⊆ H n . Thus, choosing a degree d (also known as "arity") for the kernel implicitly means controlling the capacity of the hypothesis space, which is a very important aspect in learning. The same "trick" can be applied to the monotone d-DNF kernel [9]: Instead of limiting the number of involved variables, Zhang et al. [10] proposed a parametric version of the DNF and mDNF kernel. Specifically, given x, z ∈ {0, 1} n and σ > 0, then The parameter σ induces an inductive bias towards simpler or more complex DNF formulas. Specifically, for σ in the range [0, 1] a bias towards shorter DNF is given, while for σ > 1 the bias is more towards longer DNF. When σ = 1, then κ (σ) dnf (x, z) = κ dnf (x, z), and the same for the monotone DNF kernel. In the following, we refer to these kernel as σ-DNF and σ-mDNF kernel, respectively. Zhang et al. also proved that, for binary input vectors, the polynomial kernel, i.e., κ p POLY (x, z) = (σ x, z + c) p , c ∈ R + , is a Boolean kernel, even though they did not provide any formal definition of Boolean kernel. Nonetheless, an important observation is that the embedding space of a polynomial kernel is composed by all the monomials (that are conjunctions) up to the degree p. Thus, the only difference between the polynomial one and the d-DNF kernel are the weights associated to the features. It is also noteworthy that, in the binary case, the embedding of the polynomial kernel contains sets of equivalent features, e.g., for p = 4 and x ∈ {0, 1} 2 , the value of the features (in the feature space) A kernel related to the polynomial is the all-subset kernel [11,12], defined as which considers a space with a feature for each subset of the input variables, including the empty subset. It is different from the polynomial because it does not limit the number of considered monomials, and it gives the same weight to all the features. It is easy to see that the all-subset kernel and the monotone DNF kernel are actually the same kernel up to the constant −1, i.e., κ ⊆ (x, z) = κ mdnf (x, z) + 1. Both the polynomial and the all-subsets kernel have limited control of which features they use and how they are weighted. The polynomial kernel uses only monomials of degree up to p with a weighting scheme depending on a parameter (c). The all-subsets, instead, makes use of the monomials corresponding to all possible subsets of the n input variables.
A restricted version of the all-subset kernel is the ANOVA kernel [11] in which the embedding space is formed by monomials with a fixed degree d without repetition. For example, given x ∈ {0, 1} 3 the feature space of the all-subset kernel would be made by the features x 1 , x 2 , x 3 , x 1 x 2 , x 1 x 3 , x 2 x 3 , x 1 x 2 x 3 and ∅, while for the ANOVA kernel of degree 2 it would be composed by x 1 x 2 , x 1 x 3 and x 2 x 3 . Formally, the ANOVA kernel is defined as follows where i 1 , i 2 , . . . , i d are all the possible sets of indices of cardinality d, taken from the set {1, . . . , n}.
In [13,14], Boolean kernels are used for studying the learnability of logical formulas, specifically DNF formulas, using maximum margin algorithms, such as SVM. In particular, in [14], the authors showed the learning limitations of some Boolean kernels inside the PAC (Probably Approximately Correct) framework. Moreover, in [13], Kowalczyk at al. proposed a generalization of the Sadohara mDNF kernel. A special case of this kernel is represented by the σ-mDNF kernel.
From the application point of view, Boolean kernels have been successfully applied on many learning tasks, such as face recognition [15,16], spam filtering [17], load forecasting [18], and generic binary classification tasks [8,10].

A New Family of Boolean Kernels
In this section, we propose a new family of Boolean kernels which owns the characteristic of creating feature spaces that are very easy to understand, since they are based on logic. Specifically, features are logical formulas (of a fixed form) over the input Boolean variables.
Firstly, we present the most basic Boolean kernel and then, for both the monotone and the non-monotone cases, we propose kernels which mimic the conjunctive operator (and) and the disjunctive operator (or). Then, we provide an efficient way to combine these "base" Boolean kernels to obtain more complex ones, such as kernels with feature spaces composed by normal form formulas.
Throughout the paper, unless specified otherwise, we refer to vectors x, z ∈ {0, 1} n as binary (Boolean) vectors of dimension n ∈ N + . We also use X ≡ {i | x i = 1} and Z ≡ {i | z i = 1} as the sets interpretation of those vectors, while the set U ≡ {1, . . . , n} indicates the universal set. Given a set A, we refer to its i-th element with A i for some enumeration of the elements of A. With the notation [A] k ≡ {S ⊆ A | |S| = k}, we express the set of all the subsets of A of cardinality k. It is worth noticing that, for any binary vector x, x 2 2 = x 1 holds, which is the number of ones contained in it. For the sake of brevity, we refer to this quantity with |x|. Moreover, 1 n denotes the n-dimensional vector with all entries equal to 1, and with the notation ·, · we refer to the dot product. The symbol denotes the entry-wise multiplication between matrices. Finally, given a function f : X → Y n , Y ⊆ R, then | f | denotes the dimension of its codomain, i.e., n.
For each of the proposed kernels, the embedding function is provided in the general form is a truth value associated with the application of the formula b to x. For example, let b(x) = x 1 ∧ x 2 and v = [0, 1, 0, 1], then b(v) = 0, that is false. For the sake of simplicity, for each new kernel, only the set B from which the Boolean formulas are taken is defined.

Monotone Boolean Kernels
A Boolean function (or formula) f : {0, 1} n → {0, 1} is called monotone if replacing a 0 (i.e., false) with a 1 (i.e., true) in the input can only increase f 's value, i.e., the truth value can only change from false to true. In other words, a formula f is monotone if it does not have any not operator.

Monotone Literal Kernel
In logic, a literal is an atomic formula or its negation. Here, we are in the monotone setting, so we refer to a literal only in its positive form. In this case, the embedding is formed by Boolean literals taken from the set Hence, the monotone Literal (mL) kernel, κ mL (x, z), will count how many true (i.e., positive) input literals the vectors have in common. Actually, κ mL is exactly the linear kernel κ LIN (x, z) = x, z , which simply performs the dot product between the two input binary vectors.

Monotone Conjunctive Kernel
In Boolean algebra, given two Boolean variables x, z ∈ {0, 1}, the conjunction (i.e., and) between x and z, denoted by x ∧ z, is satisfied if and only if x = z = 1, that is if and only if both variables are true. Given two vectors x, z ∈ {0, 1} n , the monotone Conjunctive (mC) kernel [19] κ c mC (x, z) counts how many monotone conjunctions of the input variables, of a fixed arity c, are satisfied in both x and z. In particular, the embedding is defined by Boolean formulas taken from which represent all the possible conjunctions of c literals (i.e., variables) taken from x. The dimension of the resulting feature space is ( n c ), that is the number of all the combinations of c different variables taken from the input n-dimensional space. To count all the possible conjunctions of c variables satisfied in both x and z, we need to calculate the number of combinations of c monomials that can be formed by using all the positive variables in both the vectors, that is the value of the kernel κ mL (x, z). Hence, we obtain: It is easy to see that for binary vectors κ c mC is actually the ANOVA kernel of degree c [11]. As shown in [19], we can express the Sadohara mDNF Kernel [3] as a linear combination of mC-kernels of arity 1 ≤ d ≤ n as in the following: A similar construction also holds for the d-mDNF kernel.

Monotone Disjunctive Kernel
The disjunction of two Boolean variables x, z ∈ {0, 1}, denoted by x ∨ z, is not satisfied if and only if x = z = 0, that is if and only if both variables are false, or, in other words, it is satisfied anytime at least one of the variables is true. The monotone Disjunctive (mD) kernel [19], κ d mD (x, z) counts how many monotone disjunctions, of a fixed arity d, are satisfied in both x and z. The embedding of this kernel is defined by Boolean formulas taken from B To explain how to count the common positive disjunctions in both x and z, we can rely on the analogy between binary vectors and sets. An active disjunction of d literals for X can be defined as a set of d elements taken from the universe U , let us call it U d , such that ∃ a ∈ U d | a ∈ X . Anytime ∃ a, b ∈ U d | a ∈ X ∧ b ∈ Z (potentially a = b), then U d is an active subset for X and Z. Using this interpretation, we can say that the value of the kernel is the number of active subsets U d in common between X and Z. We can count the number of these subsets in a negative fashion. Starting from the number of all possible subsets, which is ( |U | d ), we remove the inactive subsets for X and for Z. An inactive subset for X is a set such that it does not contain any element of X , and the number of this kind of sets is ( |U X | d ). Analogously, we can do the same for Z. Now, we have removed twice the subsets formed by elements taken from X ∪ Z ≡ U (X ∪ Z ) and hence we need to add its contribution once, that is ( |U (X ∪Z )| d ). We can now define κ d mD as

Non-Monotone Boolean Kernels
Converse to the monotone case, non-monotone Boolean formulas can contain negated variables, e.g., ¬x i , thus the mL-kernel is not expressive enough to be the simplest non-monotone Boolean kernel because it does not consider negated variables.

Non-Monotone Literal Kernel
To include the contribution of the negated variables, we need to add the number of false variables in common between x and z to the mL-kernel. This can be calculated by the negation kernel, defined as in which the embedding is defined by Boolean functions taken from B ≡ { f i | f i (x) = ¬x i }. Hence, the non-monotone Literal (L) kernel, κ L (x, z), counts how many true and false variables x and z have in common, and it is defined as a sum of kernels [11] as in the following:

Non-Monotone Conjunctive Kernel
The non-monotone Conjunctive (C) kernel counts how many non-monotone conjunctions of a certain arity c are satisfied in both x and z.
Since we are considering conjunctions of variables, this kernel will count how many combinations of (possibly negated) common variables there are between x and z. Thus, relying on these considerations and on the definition of κ L , we can finally define the κ c C (x, z) as follows: Similar to the monotone case, we can express the Sadohara DNF Kernel [3] as a linear combination of C-kernels of arity 1 ≤ d ≤ n as in the following: An analogous construction also holds for the d-DNF kernel.

Non-Monotone Disjunctive Kernel
The non-monotone Disjunctive (D) kernel counts how many non-monotone disjunctions, of a certain arity d, are satisfied in both x and z. The embedding is defined by Boolean formulas in the set B of all the non-monotone disjunctions of d literals. Formally, and g defined as in the previous section. As for the monotone case, we derive the kernel function in a negative fashion. The number of every possible combination of arity d of variables that can be also in their negated form is 2 d ( n d ). For both x and z, we have to discard the combinations that are false, which are exactly ( n d ) because for each set of d different variables, there exists only one assignment of the negations such that the disjunction is false. For example, given the variables x 1 = 1, x 2 = 0 and x 3 = 1, only the disjunction ¬x 1 ∨ x 2 ∨ ¬x 3 is false and all the others 2 3 − 1 negation assignments are true. Then, we have to re-add the false combinations that have been discarded twice, that is the combinations made with variables that are false in both the vectors. This can be seen as the opposite of what C-kernel computes, but, since we generate all the possible combinations with all the possible negation assignments, the counting is actually the same as the C-kernel. Finally, we can define the

Boolean Kernels Combination
Given the Boolean kernels defined in the previous sections, we have now all the basic elements to build new Boolean kernels that represent a specific Boolean concept. Table 1 shows a summary of all the presented Boolean kernels. It is easy to see that all the kernels are function of dot products of the input vectors, and this allows us to create new kernels by replacing those dot products with other Boolean kernels. The logical interpretation of the new kernel depends on how the base kernels are combined. In the following, we present some new Boolean kernels generated by using the above mentioned method. Table 1. Summary of the just presented Boolean kernels: x, z ∈ {0, 1} n and |φ| stands for the dimension of the feature space.

DNF Kernels
A disjunctive normal form (DNF) is a normalization of a logical formula which is a disjunction of conjunctive clauses. In the monotone case, both conjunctive and disjunctive clauses are monotone, i.e., the literals are only in their positive form. Since DNFs are disjunctions of conjunctive clauses, we can combine the embedding maps of the mC-kernel and the mD-kernel in this way φ d,c mDNF : , obtaining the desired feature space for the monotone DNF, which leads to the definition of the mDNF-kernel as in the following: Note that we need to know the dimension of the feature space of the mC-kernel. The features of this kernel are actually monotone DNF formulas with exactly d disjunctions of conjunctive clauses of arity c. In the non-monotone case, we proceed in a similar way but, since the conjunctive clauses are non-monotone, we have to use the C-kernel instead of the mC-kernel. So, the feature space can be created by composing the embedding map of the mD-kernel with the one of the C-kernel, that is, ). Consequently, the DNF-kernel is defined as follows:

CNF Kernels
A logic formula is in conjunctive normal form (CNF) if it is composed of conjunctions of disjunctive clauses. Clearly, in its monotone form it does not contain any negated literal. By using a similar approach as for the mDNF-kernel, the feature space is defined by the function φ d,c mCNF : x → φ d mC (φ c mD (x)), and hence in the kernel function we replace the dot products inside the mC-kernel with the mD-kernel as in the following: The resulting feature space is composed of monotone CNF formulas with exactly c conjunctions of disjunctive clauses of arity d. By swapping the mD-kernel with the D-kernel, we can easily obtain the non-monotone CNF kernel: which has associated the embedding function φ d,c CNF : ). For the sake of brevity, in the rest of the paper, we indicate the mDNF-kernel having d disjunctions and c conjunctions with either the notation mDNF(d,c)-kernel or simply mDNF(d,c).

Computational Complexity
The computational complexity of the Boolean kernels described in the previous sections is bounded by the complexity of the calculation of the binomial coefficient, that is O(k) with k the arity of the combinations. Hence, computing an entire kernel matrix over n-dimensional examples taken from a dataset with l examples would lead to a complexity of O((k + n) · l 2 ). Even though it is not possible to reduce such complexity, we can take advantage of the recursive nature of the binomial coefficient to compute the kernels in a recursive fashion. By doing so, it is possible to compute higher degree kernel matrices by leveraging on kernel matrices of lower degrees.
Let the matrix K 0 be the base (Boolean) kernel matrix over the dataset D, such that , for x i , x j ∈ D and some φ with codomain {0, 1} n . Then, we can recursively define the Boolean kernels for both monotone and the non-monotone case as described in the following.

mC-Kernel
By definition, the mC-kernel matrix of arity 1, that is K 1 mC , is equivalent to the base kernel matrix K 0 . By using K 0 as base case, we can recursively define the mC-kernel matrix as

mD-Kernel
Let us define the matrices Then, we define, recursively in its parts, the mD-kernel matrix K d mD as where with the corresponding base cases N (1) = n1 l 1 l , N (1) x = N x and N (1) xz = N xz .

C-Kernel
By relying on the previous definition of S and the base case kernel matrix the C-kernel matrix can be recursively defined by

D-Kernel
Using the previous definitions of the matrices N (d) and K c C , we can define the D-kernel matrix, recursively in its parts, as Both the standard and the recursive definition have been implemented in a freely available Python module pyros available at the following URL: https://github.com/makgyver/pyros.

Evaluation Protocol
In all experiments, the kernels have been normalized using the well-known formulã .
We assessed the proposed Boolean kernels by using SVM as kernel machine and we compared them with the linear kernel, the RBF kernel, the (monotone) DNF kernel proposed by Sadohara et al. [3], the d-DNF kernel [9], the σ-mDNF kernel [10] and the Tanimoto kernel.
Both validation and test were evaluated in terms of the Area Under the ROC Curve (AUC). The used validation method is a five-fold nested cross validation. For each dataset, the test was repeated 10 times and the average performances were recorded. Specifically, we validated the misclassification cost parameter C for the SVM in the set {2 −4 , 2 −3 , . . . , 2 4 }; for each of the proposed kernels we validated both the conjunctive arity c (when applicable) and the disjunctive arity d (when applicable) in the set {1, . . . , 5}, for the RBF kernel we validated the hyper-paramater γ ∈ {10 −4 , . . . , 10 4 }, while for the d-mDNF we fixed d = 5. Finally, for the σ-mDNF kernel we validated σ in the set {0.2, 0.5, 1, 2}. All the experiments were implemented in Python using the machine learning module Scikit-learn [20]. The source code is freely available at https://github.com/makgyver/pyros.
The benchmark datasets used for the experiments are reported in Table 2. These datasets are freely available from the UCI repository [21] and the KEEL repository [22]. We selected datasets with binary or categorical features and for each of them the following preprocessing steps were performed: • Instances with missing attributes were removed. • categorical features, including the binary ones, were mapped into binary features by means of the one-hot encoding [23]. This preprocessing keeps for every example in the dataset the same number of ones , or in other words every input vector has the same euclidean norm. • Non-binary tasks were artificially transformed into binary ones, by arranging the classes into two groups while trying to keep the number of instances balanced.

Experimental Results
The first set of experiments assessed the quality of the proposed kernels. The average AUCs over 10 runs as well as the standard deviations are reported in Table 3. The last row of each table summarizes the average rank achieved by all kernels over the benchmark datasets.
It is evident from the tables that normal form (NF) kernels perform on average better than the other Boolean kernels, with the only exception of the C-kernel on the AUC metric. This is reasonable since normal form kernels are a generalization of the (m)C-kernel and the (m)D-kernel. However, after the validation procedure, there is no guarantee that a NF kernel is always better than their correspondent base kernels, as underlined by the very good performance of the C-kernel. Another interesting observation is that both the D-kernel and the mD-kernel are almost always the worst performing ones, and this can be explained by the fact that they are less expressive than the competing kernels.
Since NF kernels have shown very good performances, we built, for each normal form, the average kernel over all the degrees, that is: In this way, given a normal form, the feature space of the resulting kernel contains all the normal form formulas for each of the degrees (c, d) with 1 ≤ c ≤ C and 1 ≤ d ≤ D. Since we have fixed the maximum degrees C and D or all the kernels, and the κ avg.NF has no other hyper-parameters, it did not require any further validation. The average AUCs over the 10 runs as well as the standard deviations are reported in Table 4. In general, we can see that the normal form kernels seem to achieve performances almost always comparable or higher than the competing kernels. Good performance is also achieved by the d-mDNF kernel which we have shown is the summation of d mC-kernels.
An interesting observation can be done regarding the monks-1 dataset which can be explained by a mDNF rule. The obtaining results on monks-1 show that most of the proposed Boolean kernels achieve an AUC of 100% while all other kernels are not able to achieve this perfect score. This is because these Boolean kernels contain the target formula, or a set of related formulas, in the feature space.
All the reported results are further confirmed by other experiments with other metrics, such as precision, recall and F1, however they are not reported here for space reasons.

Conclusions
We present a family of Boolean kernels designed to have feature spaces composed by logical formulas that can be exploited to interpret the solution of a large-margin kernel machine, such as an SVM. For all kernels, we provide the logic interpretation and an efficient way to compute them. Experimental results on many categorical datasets show that the presented kernels achieve a performance comparable to the performance of state-of-the-art kernels (such as the RBF) and other Boolean kernels. In particular, we observed that, in general, those kernels corresponding to normal form achieve very good performances. In the future, we aim to develop methods able to learn how to combine different Boolean kernels, for example by means of Multiple Kernel Learning algorithms. Moreover, we also aim to find efficient and effective ways to interpret the solution of SVMs based on Boolean kernels.