1 Introduction

The field of Inductive Logic Programming (ILP) has made steady progress over the past two decades, in advancing the theory, implementation and application of logic-based relational learning. A characteristic of this form of machine-learning is that data, prior knowledge and hypotheses are usually—but not always—expressed in a subset of first-order logic, namely logic programs. Side-stepping for the moment the question “why logic programs?”, it is evident that settling on some variant of first-order logic allows the construction of tools that enable the automatic construction of descriptions that use relations (used here in the formal sense of a truth value assignment to n-tuples).

There is at least one kind of task where some form of relational learning would appear to be necessary. This is to do with the identification of functions (again used formally, in the sense of being a uniquely defined relation) whose domain is the set of instances in the data. An example is the construction of new “features” for data analysis based on existing relations (“\(F(m) = 1\) if a molecule m has 3 or more benzene rings fused together otherwise \(F(m) = 0\)”). Such features are not intended to constitute a stand-alone description of a system’s structure. Instead, their purpose is to enable different kinds of data analysis to be performed better. These may be constructing models for discrimination, joint probability distributions, forecasting, clustering, and so on. If a logic-based relational learner like an ILP engine is used to construct these relational features, then each feature is formulated as a logical formula. A measure of comprehensibility will be retained in the resulting models that use these features (see Fig. 1).

The approach usually, but not always, separates relational learning (to discover features) and modeling (to build models using these features). There will of course be problems that require the joint identification of relational features and models—the emerging area of statistical relational learning (SRL), for example, deals with the conceptual and implementation issues that arise in the joint estimation of statistical parameters and relational models. It would appear that separate construction of features and statistical models would represent no more than a poor man’s SRL. Nevertheless, there is now a growing body of research that suggests that augmenting any existing features with ILP-constructed relational ones can substantially improve the predictive power of a statistical model (see, for example: Joshi 2008; Saha et al. 2012; Specia et al. 2009; Ramakrishnan et al. 2007; Specia et al. 2006). There are thus very good practical reasons to persist with this variant of statistical and logical learning for data analysis.

There are known shortcomings with the approach which can limit its applicability. First, the set of possible relational features is usually not finite. This has led to an emphasis on syntactic and semantic restrictions constraining the features to some finite set. In practice, this set is still very large, and it is intractable to identify an optimal subset of features. ILP engines for feature-construction therefore employ some form of heuristic search. Second, much needs to be done to scale ILP-based feature discovery up to meet modern “big” data requirements. This includes the abilities to discover features using very large datasets not all stored in one place, and perhaps only in secondary memory; from relational data arriving in a streaming manner; and from data which do not conform to expected patterns (the concept changes, or the background knowledge becomes inappropriate). Third, even with “small” data, it is well-known that obtaining the value of a feature function for a data instance can be computationally hard. This means that obtaining the feature-vector representation using ILP-discovered features can take large amounts of time. This paper is concerned only with the first of these problems, namely how to construct models when feature-spaces are very large. The data is partitioned and placed on different processors (or nodes).

Fig. 1
figure 1

Feature discovery with relational learning. If we knew feature F1, it is easy to construct a model for active molecules using any machine learning program (rule at the bottom). What we are talking about here is discovering the definition of F1 (box on the right), given relational descriptions of the molecules m1–m4. Once done, we may be able to construct better models for the data

We develop a simple general-purpose consensus-based modeling technique consisting of a network of computing nodes. Each node in the network:

  1. (a)

    Works with a local model that uses a small set of features;

  2. (b)

    Communicates with neighboring nodes to exchange information about its model; and

  3. (c)

    Eventually arrives at a consensus model and (usually bigger) set of features that represents the consensus with its neighbors.

We note straightaway that while the consensus-based approach does not provide an optimal solution to the feature-selection problem, we show that it does provide a way of distributing the computational task of feature-construction, and for a class of models, converge to a consensus solution.

Organization: Section 2 is a short introduction to ILP and its use in constructing features. The approach we intend to follow of distributed feature-construction followed by consensus-based modeling is introduced in Sect. 2.2.1. We view our approach as an instance of a more general technique that performs consensus-based model construction in a distributed setting. Section 3 presents a general iterative procedure for constructing models in a network of nodes capable of exchanging information about their local models and features. Experimental results are in Sect. 4. Section 5 presents related work; Sect. 6 discusses open issues and concludes the paper.

2 ILP

2.1 Specification

Since its formulation in the early 1990’s in the form of a partial specification (Muggleton 1994), the field of Inductive Logic Programming (ILP) has grown to mean various forms of relational learning, with first-order logic as the recurring theme for the representation of inputs (domain-knowledge and data) and outputs (models or hypotheses). The original specifications in Muggleton (1994) though remain a useful way to describe the function of a class of programs that construct theories for either discriminating accurately amongst two sets of examples (“positive” and “negative”), or for describing a set of examples, without any specific goal of discrimination. In what is now known as the “learning from entailment” formulation, an ILP algorithm is taken to be one that conforms to at least the following (we refer the reader to Nienhuys-Cheng and De Wolf (1997) for definitions in logic programming).

  • B is background knowledge consisting of a set of definite clauses \(= \{C_1, C_2, \ldots \}\)

  • \(\mathcal{L}\) is a language describing constraints on acceptable hypotheses

  • E is a finite set of examples \(= E^{+} \cup E^{-}\) where:

    • Positive Examples. \(E^{+}\) \(= \{e_1, e_2,\ldots \}\) is a set of definite clauses denoting instances entailed by some unknown target concept T in conjunction with the background knowledge;

    • Negative Examples. \(E^{-}\) \(= \{\overline{f_1}, \overline{f_2} \ldots \}\) is a (possibly empty) set of Horn clauses denoting instances consistent with \(B \wedge T\); and

    • Prior Necessity. \(B \not \models E^{+}\)

  • H \(= \{D_1, D_2,\ldots \}\), the output of the algorithm given B, \(\mathcal{L}\) and E, is a hypothesis about the unknown target T s.t. each \(D_i\) is consistent with \(\mathcal L\). A hypothesis H is acceptable if the following conditions are met:

    • Weak Sufficiency. Each \(D_i\) in H is a definite clause that has the property \(B \cup \{D_i\} \models e_1\vee e_2\vee \ldots \), where \(\{e_1, e_2, \ldots \} \subseteq E^{+}\)

    • Strong Sufficiency. \(B \cup H \models E^{+}\);

    • Weak Consistency. \(B \cup H \not \models \Box \); and

    • Strong Consistency. \(B \cup H \cup E^{-} \not \models \Box \);

Strong Consistency ensures that H is consistent with all of the negative examples. Often, implementations do not require hypotheses to meet this requirement, as some members of \(E^{-}\) are taken to be “noisy”. This specification is then refined to include a parameter whose value sets a lower bound on the accuracy required of each clause \(D_i\) in the theory. If the noise model extends to the positive examples, then in practice, implementations may also also not meet the Strong Sufficiency requirement.

2.2 Implementation

Given that the specifications impose fairly minimal constraints, it is not surprising that a variety of conforming (or nearly conforming) implementations have been developed. Of these, we first describe an implementation that identifies a discriminatory model (specifically, a set of classification rules), using a randomized version of a traditional greedy set-covering approach. Individual rules in the set are identified using a general-to-specific, heuristic search guided by most-specific (“bottom”) clauses:

\(ConstructModel(B,\mathcal{L},E):\)

  1. 1.

    Let H be \(\emptyset \)

  2. 2.

    Left = \(E^+\)

  3. 3.

    while \(Left \ne \emptyset \) do

    1. (a)

      Randomly chose an example \(e \in Left\)

    2. (b)

      Let \(\bot _\mathcal{L}(B,e)\) be the most specific rule in some language \(\mathcal L\) that logically entails e, given background knowledge B (see Muggleton 1995)

    3. (c)

      Search for the “best” rule h in the lattice ordered by the subsumption relation (Plotkin 1971) and bounded by the empty clause \(\top \) and \(\bot _\mathcal{L}(B,e)\).

    4. (d)

      \(PosCover = \{e: e \in E^+ ~{\mathrm and}~ B \cup {h} \models e\}\)

    5. (e)

      \(Left := Left - PosCover\)

    6. (f)

      Add h to B

    7. (g)

      Add h to H

  4. 4.

    done

Here, “best” refers to the rule with the highest value for some evaluation function. This procedure is sufficiently similar to the one followed by the classic ILP algorithm Progol, which uses a mode-based language for specifying \(\mathcal{L}\), and a simple compression-based heuristic to score clauses (see Muggleton 1995 for details).

2.2.1 Feature construction

Of more direct interest here is a derivative of ConstructModel that is used to identify Boolean features:

\(ConstructFeatures(B,\mathcal{L},E,f_{max}):\)

  1. 1.

    Let F be \(\emptyset \)

  2. 2.

    Left = \(E^+\)

  3. 3.

    while the feature constructed is less than some maximum \(f_{max}\) do

    1. (a)

      Select e from Left

    2. (b)

      Let \(\bot _\mathcal{L}(B,e)\) be the most specific rule in some language \(\mathcal L\) that logically entails e, given background knowledge B

    3. (c)

      Search for “good” rules in the lattice ordered by the subsumption relation and bounded by the empty clause \(\top \) and \(\bot _\mathcal{L}(B,e)\).

    4. (d)

      For each good rule h

      1. i.

        Convert h into a Boolean feature f

      2. ii.

        \(PosCover(h) = \{e: e \in E^+ ~{\mathrm and}~ B \cup {h} \models e\}\)

      3. iii.

        Left = Left - PosCover(h)

      4. iv.

        \(F = F \cup f\)

  4. 5.

    done

  5. 6.

    return F

Now a “good” rule is taken to be one that satisfies some syntactic and semantic constraints (for example, precision and recall). The reader will recognize that Strong Sufficiency is only relevant to the ConstructModel procedure (that is, ConstructFeatures is not attempting to obtain a hypothesis that explains all the positive examples).

The conversion of an ILP rule to a Boolean feature is straightforward. Let us assume that we are constructing rules only for some class c (usually c would be the class of positive examples) and that a data instance is denoted nominally by \({\mathbf x}\) drawn from some space \(\mathcal{X}\). Then a good rule will be of the form \(h_j:\) \(Class({\mathbf x},c) \leftarrow {Cp}_j({\mathbf x}).\) Footnote 1 We adopt the terminology from Ratnaparkhi (1999) and \({Cp}_j: \mathcal{X} \mapsto \{0,1\}\) denotes a “context predicate”. A context predicate corresponds to a conjunction of literals that evaluates to TRUE (1) or FALSE (0) for any element of \(\mathcal{X}\). For meaningful features we will usually require that a \({Cp}_j\) contain at least one literal; in logical terms, we therefore require the corresponding \(h_j\) to be definite clauses with at least two literals. A rule \(h_j: Class({\mathbf x},c) \leftarrow {Cp}_j({\mathbf x})\), is converted to a feature \(f_j\) using a one-to-one mapping as follows: \(f_j({\mathbf x}) = 1~{\mathrm {iff}}~{Cp}_j({\mathbf x}) = 1\) (and 0 otherwise). We will denote this function as Feature. Thus \(Feature(h_j) = f_j\), \({Feature}^{-1}(f_j) = h_j\). We will also sometimes refer to \(Features(H) = \{f: h \in H ~\mathrm {and}~ f = Feature(h)\}\) and \(Rules(F) = \{h: f \in F ~\mathrm {and}~ h = {Features}^{-1}(f)\}\).

The idea of constructing propositional representations from first-order ones goes back at least to 1990, with the LINUS system (and perhaps even earlier to work in the mid 1980s done by R.S. Michalski and co-workers). This specific use of rules constructed by ILP system as Boolean features appears to have been demonstrated first in Srinivasan and King (1996).

There is now a growing body of research that suggests that ILP-constructed relational features obtained in this manner can substantially improve the predictive power of statistical models (see the section on “Related Work” later in the paper). The principal difficulty is, of course, to determine how many, and which features are worth constructing for any particular kind of statistical model. In general, the number of possible relational features can be extremely large (even when confined to the world of Boolean propositional symbols, the number of rules and hence features, is exponential in the number of symbols).

To alleviate some of the difficulties just listed with feature-construction, we consider the possibility of arriving at a consensus model, using a distributed construction of small sets of features. The idea is shown diagrammatically in Fig. 2. Each ILP engine in the figure implements.

\(ConstructFeatures(B,\mathcal{L},E,f_{max})\) for some (small) value of \(f_{max}\). That is, each ILP engine has access to all the background knowledge and examples (we will return to this requirement later). The set of features returned by the ILP engines at a pair of nodes may or may not overlap,Footnote 2 and each node constructs a local model using the features from its ILP engine.

In this paper, we are interested in the following question: is it possible to arrive at a consensus model, starting from multiple models using different feature-sets (which may have common elements). As we shall see, when each node constructs a local linear model, the iterations of a consensus-based algorithm (described next), result in all nodes in the network exchanging feature weights and moving to different states, but finally converging on the optimal weights for all the features.Footnote 3 On the face of it, this would appear to contradict the Fischer, Lynch, Paterson (FLP) result (Fischer et al. 1985) that asserts the impossibility of reaching consensus in a distributed system. The consensus problem described in FLP involves an asynchronous system of processes, some of which may be unreliable. The question then is: how can the reliable processes have a consistent view of the system? Our setting is similar, in that we are concerned with an asynchronous system of processes (nodes) which must have a consistent view of data stored on them (weights of the features). The process of local model construction assigns a set of weights to the features - these can get updated by the process of communication (gossip) with other nodes in the network. The work differs from the FLP setting, however, in that we assume that there are no failures in the distributed system ensuring that the FLP result is not violated.

Fig. 2
figure 2

An illustrative example of the consensus-based approach using Michalski’s “Trains” problem (Larson and Michalski 1977). Each node has local features (for e.g., \(f_1(T) = 1, \text { if has } \)_\(car(T,C), \text { short(C)}; 0\) otherwise) generated from an ILP engine. In this figure, we use the notation, \(f_{11}\) indicates the first feature for Node 1, \(f_{21}\) the first feature of Node 2 and so on. It then builds local models, estimates loss and shares information with its neighbors. Eventually we would like all nodes to converge to the same model

3 An algorithm for consensus-based modeling

We use a general setting for consensus-based modeling, without any reference to ILP.

Let M denote an \(n \times m\) matrix with real-valued entries. This matrix represents a dataset of n tuples of the form \(X_i \in \mathbb {R}^m, 1 \le i \le n\). Assume, without loss of generality, this dataset has been vertically distributed over k sites \(S_1, S_2, \cdots , S_k\) i.e. site \(S_1\) has \(m_1 \) features, \(S_2\) has \(m_2\) features and so on, such that \(|m_1| + |m_2|+ \cdots + |m_k| = |m|\), where \(|m_i|\) represents the number of features at site \(S_i\).Footnote 4 Let \(M_1\) denote the \(n \times m_1\) matrix representing the dataset held by \(S_1\), \(M_2\) denote the \(n \times m_2\) matrix representing the dataset held by \(S_2\) and so on. Thus, \(M = M_1:M_2: \cdots : M_k\) denotes the concatenation of the local datasets.

We will restrict ourselves to learning a linear discriminative function over the data set M. The global function to be estimated is represented by \(J_{g} = M W_{g}^{T} \) where \(W_g\) is assumed to be a \(1 \times m\) weight vector. If only the local data is used, at site \(S_1\), the local function estimated would be \(J_1 = M_1 W_{1}^{T}\). At site \(S_2\), the local function estimated would be \(J_2 = M_2 W_{2}^{T}\). The goal is to describe a de-centralized algorithm for computing the weight vectors at sites \(S_1, \cdots S_k\) such that on termination \(W_{1} \approx W_g[1:m_1], W_2 \approx W_g[1:m_2], \cdots W_k \approx W_g[1:m_k]\) where \(W_g[1:m_i]\) represents the part of the global weight vector for the attributes stored at that site \(S_i\). Clearly, if all the datasets are transferred to a central location, the global weight vector can be estimated. Our objective is to learn the function in the decentralized setting assuming that transfer of actual data tuples is expensive and may not be allowed (say for example due to privacy concerns). The weights obtained at each site on termination of the algorithm will be used for ranking the features.

figure a

3.1 Algorithm

Algorithm 1 makes the following assumptions:

  1. 1.

    Model of Distributed Computation. The distributed algorithm can be seen as evolving over discrete time with respect to a “global” clock. However, the existence of this clock is of interest only for theoretical analysis. Each site has access to a local clock. Furthermore, each site has its own memory and can perform local computation (such as computing the gradient on its local features). It stores \(J_i\), which is the estimated local function. Besides its own computation, sites may receive messages from their neighbors which will help in the evaluation of the next estimate for the local function.

  2. 2.

    Communication Protocols. Sites \(S_i\) are connected to one another via an underlying communication framework represented by a graph G(VE), such that each site \(S_i \in \{S_1, S_2, \cdots , S_k\}\) is a vertex and an edge \(e_{ij} \in E\) connects sites \(S_i\) and \(S_j\). Communication delays on the edges in the graph are assumed to be finite. It must be noted that the communication framework is usually expected to be application dependent. In cases where no intuitive framework exists, it may be possible to simply rely on the physical connectivity of the machines, for example, if the sites \(S_i\) are part of a large cluster.

Algorithm 1 describes how the weights for features will be estimated using a consensus-based protocol. There are two main sub-parts of the algorithm: (1) Exchange of local function estimate and (2) Local update based on stochastic gradient descent. Each of these sub-parts are discussed in further detail below. Furthermore, assume that \(J: R^m \rightarrow [0, \infty ]\) is a continuously differentiable nonnegative cost function with a Lipschitz continuous derivative.

Exchange of local function estimate: Each site locally computes the loss based on its features and then gossips with its neighbors to get information on other attributes. On receiving an update from a neighbor, the site re-evaluates \(J_i\) by forming a component-wise convex combination of its old vector and the values in the messages received from its neighbors i.e. \(J_i^{t+1}=\alpha _{ii}(X_i W_i^T) + \alpha _{ji} (X_j W_j^T)\). It is interesting to note that \(\alpha _{ij}, 0 \le \alpha _{ij} \le 1\), is a non-negative weight that captures the fraction of information site i is willing to share with site j. The choice of \(\alpha _{ij}\) may be deterministic or randomized and may or may not depend on the time t (Kempe et al. 2003). The \(k\times k\) matrix A comprising of \(\alpha _{ij}, 1 \le i \le k, 1 \le j \le k\) is a “stochastic” matrix such that it has non-negative entries and each row sums to one. More generally, this reflects the state transition probabilities between sites. Figure 3 illustrates the state transition between two sites \(S_i\) and \(S_j\).

Another interpretation of the diffusion of \(J_i\) amongst the neighbors of i involves drawing analogies from Markov chains – the diffusion is mathematically identical to the evolution of state occupation probabilities. Furthermore, a simple vector equation can be written for updating \(J_i^t\) to \(J_i^{t+1}\) i.e. \(J_i^{t+1} = A(i) (J_i^t)_{N_i}\) where A(i) corresponds to the row i of the matrix A and \((J_i^t)_{N_i}\) is a matrix that has \(|N_i|\) rows (each row corresponding to a neighbor of Site \(S_i\)) and n columns (each column corresponding to all the instances). More generally, \(\mathcal {J}^{t+1} = A \mathcal {J}^{t}\) where \(\mathcal {J}^{t+1}\) is a \(k \times n\) matrix storing the local function estimates of each of the n instances at site k and A is the \(k \times k\) transition probability matrix corresponding to all the sites. It follows that \(lim_{t \rightarrow \infty } A^t\) exists and this controls the rate of convergence of the algorithm.

Fig. 3
figure 3

State Transition Probability between two sites \(S_i\) and \(S_j\)

We introduce the notion of average function estimate in the network \(\mathbf {J_i^t} = \sum _i \frac{J_i^t}{k}\) which allocates equal weight to all the local function estimates and serves as a baseline against which individual sites \(S_i\)’s can compare their performance. Philosophically, this also implies that each local site should at least try to attain as much information as required to converge to the average function estimate. Since \(\sum _i{\alpha _{ij}}=1\), this estimate is invariant.

The A matrix has interesting properties which allow us to show that convergence to \(\mathbf {J_i^t}\) occurs. One such property is the Perron-Frobenius theory of irreducible non-negative matrices. We state the theorem here for continuity.

Theorem 1

(Perron–Frobenius (Varga 1962)) Let A be a positive, irreducible matrix such that the rows sum to 1. Then the following are true:

  1. 1.

    The eigenvalues of A of unit magnitude are the k-th roots of unity for some k and are all simple.

  2. 2.

    The eigenvalues of A of unit magnitude are the k-th roots of unity if and only if A is similar under a permutation to a k cyclic matrix.

  3. 3.

    All eigenvalues of A are bounded by 1.

Since the eigenvalues of A are bounded by 1, it can be shown that \(J_i^t\) converges to the average function estimate \(\mathbf {J_i^t}\) if and only if -1 is not an eigen value (Varga 1962). Let \(\lambda _n \le \lambda _{n-1} \le \cdots \le \lambda _2 < \lambda _1 =1\) be the eigenvalues of A with \(\lambda _1 = 1\). Also assume that \(\gamma (A) = \text {max}_{i>1} |\lambda _i|\). It can be shown that \(\parallel J_i^{t+1} - \mathbf {J}_i^{t} \parallel ^2 \le \gamma ^2 \parallel J_i^{t} - \mathbf {J}_i^{t} \parallel \). If \(\gamma =1\), then system fails to converge (Varga 1962; Cybenko 1989).

Local stochastic gradient update is done as follows: \(W_i^{t+1} = W_i^t - \eta _i^t s_i^t\) where \(s_i^t=\frac{\partial J_i^t}{\partial W_i^t} (X_r, W_i^t), X_r \in \mathbb {R}^{m_i}\) is the estimated gradient, \(W_i^t\) is the weight vector and \(\eta _i^t\) is the learning rate at node i at time t.

It is evident that there are no restrictions on the features used by the DFE algorithm. The proofs of correctness and termination are in “Appendix”, and we will henceforth refer to the procedure as the DFE algorithm. We now investigate empirically the performance of the algorithm when the nodes in the network use features constructed locally by an ILP engine.

4 Empirical evaluation

4.1 Aims

Our objective is to investigate empirically the utility of the consensus-based algorithm we have described. We use Model(kf) to denote the model returned by the consensus-based algorithm in Sect. 3 using k nodes in a network, each of which can call on an ILP engine to construct at most f features. In this section, we compare the performance of: Model(NF) \((N > 1)\) with \(Model(1,N \times F)\). The latter effectively represents the model constructed in a non-distributed manner, with all features present at a single centralized node. For simplicity, we will call the former the Distributed model and the latter the Centralized model.

We intend to examine if there is empirical support for the conjecture that the performance of the Distributed model is better than that of the Centralized model. We are assuming that the performance of a model construction method is given by the pair (AT) where A is an unbiased estimate of the predictive accuracy of the classifier, and T is an unbiased estimate of the time taken to construct a model. In all cases, the time taken to construct a model also includes the time taken to identify the set of features by the ILP engine and the time to compute their values. When \(k > 1\), the time will also include time for exchanging information. Comparisons of pairs \((A_1,T_1)\) and \((A_2,T_2)\) will simply be lexicographic comparisons.

4.2 Materials

4.2.1 Data

Data for experiments are in two categories:

  1. 1.

    Synthetic We use the “Trains” problem posed by R. Michalski for controlled experiments. Datasets of 1000 examples are obtained for randomly drawn target concepts (see “Methods” below).Footnote 5 For this we use S.H. Muggleton’s random train generatorFootnote 6 that defines a random process for generating examples. We will use this data for controlled experiments to test the principal conjecture about the comparative performances of Distributed and Centralized models.

  2. 2.

    Real We report results from experiments conducted using 7 well-studied real world problems from the ILP literature. These are: Mutagenesis (King et al. 1996); Carcinogenesis (King and Srinivasan 1996); DssTox (Muggleton et al. 2008); and 4 datasets arising from the comparison of Alzheimer’s drugs denoted here as Amine, Choline, Scop and Toxic (Srinivasan et al. 1996). The dataset characteristics are reported in Table 1. Our purpose in examining performance on the real-data is twofold. First, we intend to see if the use of linear models is too restrictive for real problems. Second, we would like to see if the results obtained on synthetic data are reflected on real-world problems. We note that for these problems predictive accuracy is the primary concern.

Table 1 Dataset sizes and class distributions

Language constraints We have mainly relied on the use of mode declarations to incorporate language restrictions (see Muggleton 1995 for a description of modes and their usage by a class of ILP systems). For the synthetic dataset of trains, here are some examples of mode declarations (we use the syntax introduced in Muggleton (1995):

figure b

We follow the Aleph manual (Srinivasan 1999) for the meaning of restrictions. All declarations are of the form mode(RecallNumber,PredicateMode).Footnote 7 Here RecallNumber bounds the non-determinacy of a form of predicate call, and PredicateMode specifies a legal form for calling a predicate. RecallNumber can be either (a) a number specifying the number of successful calls to the predicate; or (b) a \(*\), specifying that the predicate has bounded non-determinacy. PredicateMode is a template of the form: p(ModeType, ModeType,...) where ModeType is either (a) simple; or (b) structured. A simple ModeType is one of: +T , which means that when a literal with predicate symbol p appears in a hypothesized clause, the corresponding argument should be an “input” variable of type T; -T, which means that that the corresponding argument is an “output” variable of type T; or \(\#\) T, which means that the corresponding argument should have a constant of type T. A structured ModeType is of the form: f(...) where f is a function symbol, each argument of which is either a simple or structured ModeType.

Some examples of mode declarations for the Mut188, Canc330 and DssTox datasets:

figure c

For reasons of space, we do not show the mode declarations for the other datasets.

4.2.2 Algorithms and machines

The DFE algorithm has been implemented on a Peer-to-Peer simulator, PeerSim (Montresor et al. 2009). This software sets up the network by initializing the nodes and the protocols to be used by them. The newscast protocol, an epidemic content distribution and topology management protocol is used. Nodes can perform actions on local data as well as communicate with each other by selecting a neighbor to communicate with (using an underlying overlay network). In each communication step, they mutually update their approximations of the value to be calculated, based on their previous approximations. The emergent topology from a newscast protocol has a very low diameter and is very close to a random graph (Jelasity et al. 2004, 2005).

The ILP system used in all experiments is Srinivasan (1999). The latest version of this program (Aleph 6) is available from the second author. We use Aleph to construct features [specifically, the induce_features command in that program: the precise description of how this is done is in Srinivasan (1999)]. To a good approximation, the procedure is as described in Sect. 2.2.1 (some small difference arises from the Aleph implementation using a class-based upper-bound on the number of features). The Prolog compiler used is YapFootnote 8 (version 6.2.0). The programs are executed on a dual Quad-Core AMD Opteron 2384 processors equipped with 2.7 GHz processors, 32 GB RAM, and local storage of \(4 \times 146\) GB 15K RPM Serial attached SCSI (SAS) hard disks.

4.3 Method

For the synthetic data, classification tasks are randomly constructed for the DFE algorithm based on disjunctive concepts. “Simple” concepts have 1–4 disjuncts, and “complex” ones between 8–12 disjuncts.Footnote 9 For any one classification task a data instance x is defined as “positive” is some underlying concept is true for x. If the underlying concept is a simple concept, then the classification task is said to have a simple target; otherwise it has a complex target. Classification tasks are constructed randomly as follows:

  1. 1.

    A concept (“Target”) is generated by:

    1. (a)

      Randomly obtaining the number of disjuncts k;

    2. (b)

      Drawing k features from a population of features; and

    3. (c)

      Defining the concept as the disjunction of the k features.

  2. 2.

    A binary classification task is then defined using the concept constructed, and positive and negative instance are generated randomly as training data for this classification task.

Figure 4 shows an example of the relationship between features, concepts and targets. We will refer to the Steps (a)–(c) as “randomly drawing a target concept.” Of course, when given the training data, the DFE algorithm does not know the features used in the target concept. Instead, nodes in the network use the data, background knowledge, and their ILP engines to construct local features and a consensus model is obtained to discriminate between positive and negative examples. We note that it may not be sufficient simply to identify one of the k disjuncts in a Target. With sufficient data, there will be instances where each one of the disjuncts is FALSE. Thus, a node in the distributed setting that correctly identifies one of the features can still have a poor accuracy on the training data.

Fig. 4
figure 4

Classification problems using simple and complex concepts. The features \(F_1\), \(F_2\), ...are functions whose values are TRUE, depending on the conditions in their definitions. Simple concepts have 1–4 disjuncts, and complex ones between 8–12 disjuncts. Binary classification tasks for the DFE algorithm are based on simple or complex concepts: a data instance x is defined as “positive” is some underlying concept is true for x (shown here for \(Simple_1\)). If the underlying concept is a simple concept, then the classification task is said to have a simple target; otherwise it has a complex target. For each classification task positive and negative instances are generated randomly and provided as training data to the DFE algorithm

Our method for experiments is straightforward:

  1. 1.

    For each kind of concept (“simple” or “complex”)

    1. (a)

      Randomly draw a target concept

    2. (b)

      Classify each data instance as \(+\) or − using the target concept

    3. (c)

      Randomly generate a network with N nodes

    4. (d)

      For each node in the network:

      1. i.

        Set the number of iterations T and initialize the learning parameter \(\eta _i\) for the node. It is assumed that all nodes agree on the initial choice of T and \(\eta _i = \eta \).

      2. ii.

        Execute the algorithm described in Sect. 1 for T iterations and the ILP engine restricted to constructing F features

      3. iii.

        Record the predictive accuracy A of the (local) model along with the time T taken to construct the model (this includes the feature construction time, and the feature computation time). The pair (AT) is the performance of the Distributed model for the concept.

    5. (e)

      Using a network with a single node:

      1. i.

        Execute the algorithm described in Sect. 1 for T iterations, learning parameter \(\eta \), and the ILP engine restricted to constructing \(N \times F\) features

      2. ii.

        Record the predictive accuracy \(A'\) of the model along with the time taken to construct the model \(T'\) (again, this includes the feature construction time and feature computation time). The pair \((A',T')\) is the performance of the Centralized model for the concept.

  2. 2.

    Compare the performances of the Distributed and the Centralized models for the concepts.

The following additional details are relevant:

  1. 1.

    Two sources of sampling variation result with this method. First, variations are possible with the target drawn in Step 1a. Second, to ensure that both the Distributed and Centralized approaches are constructing features from the same feature-space, we employ the facility within Aleph of drawing features from an explicitly defined feature space (this is specified using a large tabulation of features allowed by the language constraints). In effect, we are performing a randomized search for good features within a pre-defined feature space. Although only “good” features are retained (see below), even after controlling for feature-spaces, sampling variations can nevertheless result for both the Distributed and Centralized models from the step of drawing features. We report averages for 5 repetitions of draws for the target, and 5 repetitions of the randomized search for a given target.

  2. 2.

    A target is generated as follows. For simple targets, the number of features is chosen randomly from the range 1–4. For complex concepts, the number of features is randomly chosen from the range 8–12. Features are then randomly constructed using the ILP engine, and their disjunction constitutes the target concept.

  3. 3.

    As noted previously, data instances for controlled experiments are drawn from the “Trains” problem. The data generator uses S.H. Muggleton’s random train generator. This implements a random process in which each data instance generated contains the complete description of a data object (nominally, a “train”).

  4. 4.

    An initial set of parameters needs to be set for the ILP engine to describe “good” features. These include C, the maximum number of literals in any acceptable clause constructed by the ILP system; Nodes, the maximum number of nodes explored in any single search conducted by the ILP system; Minacc, the minimum accuracy required of any acceptable clause; and Minpos, the minimum number of positive examples to be entailed by any acceptable clause. C and Nodes are directly concerned with the search space explored by the ILP system. Minacc and Minpos are concerned with the quality of results returned (they are equivalent to “precision” and “support” used in the data mining literature). We set \(C=4\), Nodes=5000, Minacc=0.75 and Minpos=2 for our experiments here. There is no principled reason for these choices, other than that they have been shown to work well in the literature (Srinivasan and Ramakrishnan 2011).

  5. 5.

    The parameters for the PeerSim simulator include the size of the network, degree distribution of the nodes and the protocol to be executed at each node. We report here on experiments with a distributed network with \(N=10\) nodes. Each of these nodes can construct up to \(F = 500\) features (per class) and the centralized approach can construct up to \(N \times F = 5000\) features (per class).

  6. 6.

    The experiments here use the Hinge loss function. The results reported are for values of T that the stochastic gradient descent method starts to diverge.

  7. 7.

    The learning rate \(\eta _i\) remains a difficult parameter in any SGD-based method. There is no clear picture on how this should be set. We have adopted the following domain-driven approach. In general, lower values of the learning rate imply a longer search. We use three different learning rates corresponding to domains requiring high, moderate and low amounts of search (corresponding to complex, moderate or simple target concepts). The corresponding learning rates are 0.01, 0.1 and 1. We reiterate that there is no prescribed method for deciding these values, and better results may be possible with other values. The maximum number of iterations T is set to a high value (1000). The algorithm may terminate earlier, if there are no significant changes to its weight vector.

  8. 8.

    Since the tasks considered here are binary classification tasks, the performance of the ILP system in all experiments will be taken to be the classification accuracy of the model produced by the system. By this we mean the usual measure computed from a \(2 \times 2\) cross-tabulation of actual and predicted classes of instances. We would like the final performance measure to be as unbiased as possible by the experimental estimates obtained during optimization, and estimates are reported on a holdout set.

  9. 9.

    With results from multiple repetitions (as we have here), it is possible to perform a Wilcoxon signed-rank test for both differences in accuracy and differences in time. This allows a quantitative assessment of difference in performance between the Distributed and the Centralized models. However, results with 5 repetitions are unreliable, and we prefer to report on a qualitative assessment, in terms of the average of accuracy and time taken.

A data instance in each of the real datasets is a molecule, and contains the complete description of the molecule. This includes: (a) bulk properties, like molecular weight, logP values etc.; and (b) the atomic structure of the molecule, along with the bonds between the atoms. For these datasets, clearly there are no concepts to be drawn, and sampling variation results solely from the feature-construction process. We therefore only report on experimental results obtained from repeating the randomized search for features. Again, estimates of predictive accuracy are obtained from a holdout set. For mutagenesis and carcinogenesis, each of the 10 computational nodes in the distributed network constructs up to 500 features, and the centralized approach constructs up to 5000 features (per class). For DssTox, we found there were fewer high precision features than the other two datasets. So the nodes in the distributed network constructs up to 50 features and the centralized node up to 500 features (per class).

Table 2 Results on synthetic data comparing Centralized and Distributed models

4.4 Results

We present first, the main results from the experiments on synthetic data (shown in Table 2). The primary observations in these experiments are as follows: (1) On average, as concepts vary, the distributed algorithm appears to achieve higher accuracies than the centralized approach, although the differences may not be significant for a randomly chosen concept; (2) On average, as concepts vary, the time taken for model construction by the distributed approach can be substantially lowerFootnote 10; and (3) The variation in both accuracies and time with the distributed approach due to both changes in the concept, or due to repetitions of feature-construction appear to be less than the centralized approach.

Taken together, these results suggest that good, stable models can be obtained from the distributed approach fairly quickly, and that the approach might present an efficient alternative to a centralized approach in which all features are constructed by a single computational unit.

At this point a question could be raised on the value of the synthetic data. There are at least 3 issues here:

  • First, why bother with synthetic data at all? The answer to this is that it gives us the opportunity to perform controlled experiments, of the kind that would be impossible with real-world datasets.

  • Secondly, why use the “trains” problems? The trains problems are a well-known benchmark in the ILP literature, and there is an easily available simulator that is capable of generating new data instances by random draws from a known distribution (each instance is a random train with the number of carriages in the train following a multinomial distribution). There is precedence in the ILP literature of using this to test algorithms: for example, Cardoso & Zaverucha used a synthetic trains dataset of 1.25 million examples to evaluate their methods, all of which achieved 100% accuracy on the synthetic dataset.

  • Thirdly, why are we getting theories with high accuracies? The targets are randomly generated k-disjuncts with different values of k (1–4 for “simple” targets and 8–12 for “complex” targets). Further, each feature is a rule restricted to Datalog without recursion, a bound on the maximum number of literals, each of which is a predicate with a fixed maximum arity, the hypothesis space is clearly bounded and therefore learnable (in the PAC-sense) with arbitrarily high accuracy and confidence, given sufficient examples. It is also known theoretically that Winnow can identify a linear threshold function for k-disjuncts making a small number of mistakes. So, with sufficiently large amounts of data (recall we use 1000 training instances here), it is not surprising that high accuracies are obtainable. What is surprising is that the accuracy on simple concepts is lower than on complex concepts. This suggests that there must be more simple concepts that are consistent with the target on the training data than complex ones.

Table 3 Results on real data comparing Centralized and Distributed models

What can we expect from the consensus-based learner on the real datasets? Results are in Table 3, and we observe the following: (1) There is a significant difference in accuracies between the distributed and centralized models on two of the datasets (Canc330 and DssTox). On balance, we cannot conclude from this that either one of the models is better; (2) As with the synthetic data, the time for the distributed models is substantially lower. As before, the time is dominated by the feature-construction effort. For the real data sets, it appears that it is substantially easier to get smaller subsets of good features than larger ones (as observed from the differences in the times between the distributed and centralized models); and (3) Comparisons against the baseline suggest that the use of linear models is not overly restrictive, since the models obtained are not substantially worse (predictively speaking) than the ones obtained by ILP literature in the past. The Distributed approach does better than Baseline on 5 of 7 datasets: this presents some evidence supporting the case for the former, although numbers are not large enough to claim statistical significance. Quantitatively, the differences in predictive accuracies between Distributed and Centralized are not statistically significant, although there is evidence that differences in time is statistically significant (Distributed is faster).

Again, taken together, these results provide support to the trends observed with the controlled experiments and suggest that the distributed approach would continue to perform at least as well as the centralized approach on real data.

4.5 Supplementary results

We turn now to some issues that have been brought out by the experimental results. We present immediate practical issues first, and then examine some more abstract questions.

4.5.1 The learning rate \(\lambda \).

As with all methods based on stochastic gradient descent, the central parameter remains the learning rate \(\lambda \). Many strategies have been suggested in literature to automatically adjust the learning rates. (see for example Bottou 2010; Bottou and Bousquet 2011; Darken and Moody 1990; Sutton 1992). In general, the learning rate on an iteration \(\eta _i\) of the algorithm here is of the form \(\eta _i = \frac{B}{T^{-\alpha }}\); \(B=e^{-\lambda T}; 0 < \alpha \le 1\). Table 4 shows the effect of varying \(\lambda \) for the synthetic datasets used here. These results show that the determination of \(\lambda \) is a tricky business, that can depend on the nature of the target theory being approximated (correctly, it is really to do with the amount of search needed). For the experiments reported above, we have used fixed values of \(\lambda \) based on our assessment of the search required (see the additional details in the “Methods” section). That \(\lambda \) is dataset dependent and needs to assigned in some domain-dependent manner appears to be an unavoidable aspect of any SGD-based method.

Table 4 Effect of the learning rate \(\lambda \)

4.5.2 Disaggregated time

The time estimates reported in the tabulations consist of four separate components. These are: (a) feature-construction; (b) feature-evaluation; (c) model-construction; and (d) inter-node communication. Components (a)–(c) are part of both centralized and distributed learners, and it is of some interest to examine their separate contributions. Feature evaluation is very small for both simple and complex trains (\(<0.01\%\)). Feature-construction is roughly \(99\%\) of the time spent in feature engineering.

We note that all experiments reported here use the PeerSim simulator, which does have provisions for modeling the transport layer via a special protocol that provides a message sending service. There are also options for modeling latency among geographically distributed nodes in addition to churn models for nodes. None of these aspects have been explored here, and the tabulation reported here use default settings of the simulator that are designed to model a network as a random graph. It is nevertheless evident that distributed model construction is only worthwhile provided communication costs are low—there will be real networks (and corresponding simulator settings) for which distributed model-construction can be a good or a bad approach.

Figure 5 depicts the variation in the feature and model construction time repetitions of randomly drawn concepts. For model construction, the number of iterations T is varied between 1, 5 and 10. In all the cases, a single iteration of the stochastic gradient descent algorithm is good enough to obtain reasonably good accuracy in the distributed settings. Also, for small values of T, the feature construction time dominates communication costs.

Fig. 5
figure 5

Feature and model construction times for randomly drawn concepts. The plots are averages over 5 repetitions. a Simple, b Complex

4.5.3 Consensus by union of features

A natural question that emerges at this point is this: “How does the algorithm proposed here compare against one that achieves a trivial consensus simply by a union of all the features found by nodes in the network?” Such a consensus would be arrived as follows. Let us assume without loss of generality, that each node gets their own sample of features and sends it to one centralized site that is responsible for the “Union” operation. Both the test and train data will, of course, be concatenated vertically, but there are several ways to arrive at the union of features:

  1. (a)

    Features are generated and sent sequentially, with some predefined notion of ordering amongst nodes (node 1 goes first then node 2 and so on). This helps to provide an empirical upper-bound for the time required to compute the union of features. We assume that the upper bound is the sum of feature construction times at each distributed node. and hence referred to as Union-ub;

  2. (b)

    Features are generated in parallel synchronously. That is, all nodes generate a set of features, starting at the same time. We refer to this as Union-lb corresponding to the lower bound; and

  3. (c)

    Features are generated in parallel, but asynchronously. That is, nodes generate sets of features, not necessarily starting at the same time. In this case, there is no straightforward way to compute a bound on the time to compute features, since nodes can elect to commence feature-construction at any point in time.

The results presented in Table 5 provide the accuracy and time for building the model under the union operation. In general, the model(s) built on the “Union” of features does appear to have comparable performance to the distributed algorithm presented here, but this may come at a price if a sequential operation is done to obtain the union of features. Figure 6 further analyzes the time by dividing it into feature and model construction time(s).

Table 5 Results on synthetic data for a consensus-classifier using the union of feature-sets identified at each node in a distributed network
Fig. 6
figure 6

Feature and model construction time for randomly drawn simple and complex targets. Lower- and upper-bounds for a consensus-classifier that uses the union of features constructed by nodes in a distributed network (Union-lb and Union-ub respectively) are compared against the centralized and distributed models. a Simple, b complex

4.5.4 Network topology

Experiments have been designed to test the effect of the network topology on the convergence rate of the algorithm using synthetic data. Assuming K to be the number of outgoing edges from a node, three different strategies of node addition are explored: (a) Random: K nodes are added at random; (b) Star: K nodes are added in a star topology; and (c) Scale Free Network: A scale-free network is grown using the Barabasi-Albert (BA) model ensuring that nodes have power law degree distributions. Two values of \(K = 2, 8\) were used in empirical analysis. Our results indicate that there were no statistically significant difference in performance for the 10 node network(s) studied in this paper.

4.5.5 Convergence accuracy

In experiments here (both with synthetic and real data), we have observed that the predictive accuracy of the model from the distributed setting is comparable to the predictive accuracy from the non-distributed setting. Unlike gains in time which can be expected from a distributed setting, it is not evident beforehand what can be expected on the accuracy front. This is because the models constructed in the two settings can, and usually do, sample different sets of features. The results here suggest a conjecture that the consensus-based approach will always converge to a model that is within some small error bound of the model from a centralized approach with the same number of features. We have some reason to believe that this conjecture may hold in some circumstances, based on the use of Sanov’s theorem (Sanov 1957) and related techniques.

5 Related work

We review related work from several areas: large scale feature selection, decentralized optimization and consensus-based learning, and discovery of feature subsets by ILP engines.

Large scale feature selection Techniques for selecting from a (large) but finite set of features of known size d Footnote 11 have been well-studied within the machine learning community, usually under the umbrella-terms of filter-based or wrapper-based methods (see for example, John et al. 1994; Liu and Motoda 1998). While most of the early work was intended for implementation on a single machine, several algorithms have been proposed to enable feature-selection from massive datasets (Garcia et al. 2006; Lopez et al. 2006; Sun 2014).  Singh et al. (2009) propose a framework for handling feature selection for logistic regression by developing a new forward feature selection heuristic that ranks features by their estimated effect on the resulting model’s performance. Zhao et al. (2012) describe an algorithm that selects features based on their ability to explain data variance. Zhou et al. (2014) present a framework for Parallelizable Feature Selection (PFS) which is inspired by the theory of group testing. Group testing is a combinatorial search paradigm where the goal is to identify a small subset of relevant items from a large pool of possible items. The feature selection problem in the group testing framework applies a “test” to a set of features which produces a score designed to measure the quality of the features. From the collection of test scores the relevant features are supposed to be identified. PFS has several similarities to the algorithm proposed in this paper - notably, the set of features at each node can be viewed as a collection of relevant features for group testing; the process of local function evaluation can be mapped to the “test” required in group testing. The end product, however, in the two algorithms is fundamentally different—in PFS, all feature sets are identified in advance without knowing the scores of other tests and a final subset of features is discovered; nodes executing the DFE algorithm have updated scores of the local features after gossiping with neighbors. The updated scores help learn approximations of the global objective and nodes independently reach a consensus.

Furthermore, distributed computation is utilized to solve optimization problems that arise when exploring huge, nonlinear and multidimensional search spaces. Distributed feature selection algorithms often solve decentralized optimization problems in the constrained and unconstrained settings (seminal work of Tsitsiklis et al. 1986; Tsitsiklis 1984; Bertsekas and Tsitsiklis 1997). The convergence properties of these decentralized optimization problems naturally affect the performance of the distributed feature selection algorithms. In recent work, it has been shown that convergence properties of distributed optimization algorithms of unconstrained optimization algorithms (such as gradient descent and its stochastic variants) can be related to the network topology of the underlying distributed infrastructure by using its spectral properties (Boyd et al. 2006; Shah January 2009; Dimakis et al. 2006; Benezit et al. 2010).

Distributed optimization Learning feature subsets in distributed environments using decentralized optimization has become an active area of research (Duchi et al. 2012; Agarwal et al. 2014; Christoudias et al. 2008) in recent years. Agarwal et al. (2014) present a system and a set of techniques for learning linear predictors with convex losses on terabyte sized datasets. Their goal is to learn problems of the form \( \text {min }_{w \in R^d} \sum _{i=1}^{n} l (w^T x_i; y_i) + \lambda R(w)\) where \(x_i\) is the feature vector of the \(i^{th}\) example, w is the weight vector and R is a regularizer. The data are split horizontally and examples are partitioned on different nodes of a cluster. Duchi et al. (2012) present a dual averaging sub-gradient method which maintains and forms weighted averages of sub-gradients in the network. An interesting contribution of this work is the association of convergence of the algorithm with the underlying spectral properties of the network. Similar techniques for learning linear predictors have been presented elsewhere (Mangasarian 1995; Ryan et al. 2010; Zinkevich et al. 2010; Niu et al. 2011; Boyd et al. 2011). The algorithm presented in this paper differs from this body of literature in that the data are split vertically amongst nodes in the cluster thereby necessitating a different algorithm design strategy. In addition, this is a batch algorithm and hence quite different from distributed online learning counterparts (Dekel et al. 2012; Langford t al. 2009; Bottou and Bousquet 2011). Das et al. (2010) show that three popular feature selection criteria—misclassification gain, gini index and entropy can be learnt in a large peer-to-peer network. This is then combined with protocols for asynchronous distributed averaging and the secure sum protocols to present a privacy preserving asynchronous feature selection algorithm.

Discovery of feature subsets by ILP engines Existing literature on discovering a subset of interesting features from large, complex search spaces such as those by ILP engines adapt one of the following strategies:

  1. 1.

    Optimally (Han and Wang 2009; Nowozin et al. 2007; Kudo et al. 2004) or heuristically (Joshi 2008; Saha et al. 2012; Specia et al. 2009; Ramakrishnan et al. 2007; Specia et al. 2006; Nagesh et al. 2012; Chalamalla et al. 2008) solve a discrete optimization problem.

  2. 2.

    Optimally (Jawanpuria et al. 2011; Nair et al. 2012) solve a convex optimization problem with sparsity inducing regularizers;

  3. 3.

    Compute all relational features that satisfy some quality criterion by systematically and efficiently exploring a prescribed search space (Pei 2004; Ji et al. 2006; Aseervatham et al. 2006; Antunes and Oliveira 2003; Pei et al. 2005; Agrawal and Srikant 1995; Pei et al. 2004; Ayres et al. 2002; Garofalakis et al. 1999; Davis et al. 2005a, b; Landwehr et al. 2007; Davis et al. 2007; Džeroski 1993; Lavrac and Dzeroski 1993).

Again, much of this has been of a non-distributed nature, and usually assumes a bound on the size of the feature-space. The latter is not the case for a technique like the one proposed in Joshi (2008). This describes a randomized local search based technique which repeatedly constructs features and then performs a greedy local search starting from this subset. Since enumeration of all local moves can be prohibitively large, the selection of moves is guided by errors made by the model constructed using the current set of features. Nothing is assumed about the size of the feature-space, making it a form of vertical partitioning of the kind we are interested in. Multiple random searches can clearly be conducted in parallel (although this is not done in the paper) (Zelezny et al. 2006; Fonseca et al. 2005; Dehaspe and De Raedt 1995). As with most randomized techniques of this kind, not much can be said about the final model.

Perhaps of most interest to the work here is the Sparse Network Of Winnow (SNoW) classifiers described in Roth (1998), Carlson et al. (1999). As it stands, this horizontally partitions the data into subsets, constructs multiple linear models using Winnow’s multiplicative update process, and finally uses a majority vote to arrive at a consensus classification. This would appear, on the surface to be quite different to what we propose here. Nevertheless, there are reasons to believe that this approach can be usefully extended to the setting we propose. It has been shown elsewhere that the Winnow-based approach can be extended to an infinite-attribute setting (Blum 1992). The work in this paper shows that consensus linear models are possible when convex cost functions are used. Finally, from the ILP-viewpoint, (Srinivasan and Bain 2014) shows how it is possible to construct Winnow-based models in an infinite-attribute setting using an ILP engine with a stream-based model of the data. Taken together, this suggests that a combination of the techniques we propose, and those in Carlson et al. (1999) can be used to develop linear models that can handle both horizontal partitioning of the data and vertical partitioning of the feature-space.

Learning k-disjuncts Although the technique we have described here is not specifically aimed at learning k-disjuncts, the synthetic Simple and Complex datasets are both of this nature. It is instructive, therefore, to note some theoretical results that have been presented in the literature on identifying such concepts. First, note that a k-term DNF formula is a disjunction of k terms, where each term is a conjunction of literals. The k-term DNF learning problem can be described as follows: Given (a) a set of Boolean variables \({Var}_{i}\) (b) a set Pos of truth value assignments \(p_i: Var \rightarrow \{0,1\}\) (c) a set Neg of truth value assignments \(n_i: Var \rightarrow \{0,1\}\) and (d) a natural number k–the goal is to find a k-term DNF formula that is consistent with Pos and Neg. In general, this is an NP-hard problem and exhaustive algorithms are impractical from a computational standpoint. Several stochastic local search based algorithms (Rückert and Kramer 2003; Rückert et al. 2002) have been proposed for DNF learning by reducing the k-term DNF problem to the well-known SAT problems. Among the more popular algorithms are Winnow like classifiers, which are known to make O(k log n) mistakes before converging on a monotone k-disjunct formula (Littlestone 1988).

6 Conclusion

A particularly effective form of Inductive Logic Programming has been its use to construct new features that can be used to augment existing descriptors of a dataset. Experimental studies reported in the literature have repeatedly shown that the relational features constructed by an ILP engine can substantially assist in the analysis of data. Models constructed in this way have looked at both classification and regression, and improvements have resulted in each case. Practical difficulties have remained to be addressed though. The rich language of first-order logic used by ILP systems engenders a very large space of possible new features. The resulting computational difficulties of finding interesting features is not easily overcome by the usual ILP-based methods of language bias or constraints. In this paper, we have introduced what appears to be the first attempt at the use of a distributed algorithm for feature selection in ILP which also has some provable guarantees of convergence. The experimental results we have presented suggest that the algorithm is able to identify good models, using significantly lesser computational resources than that needed by a non-distributed approach.

There are a number of ways in which the work here could be extended further. Conceptually, we have outlined a conjecture in the previous section that we believe is worth investigating further. If it is proven to hold, then this would be a first-of-its-kind result for consensus-based methods. In implementation terms, we are able to extend the approach we have proposed to other kinds of models that use convex loss functions, and to consider a consensus-based version of the SNoW architecture. This latter will give us the ability to partition very large datasets, and to deal with very large feature-spaces at once. It is also not required within the approach that all computational nodes draw from the same feature space (this was a constraint imposed here to evaluate the centralized and distributed models in a controlled manner). It may be both interesting and desirable for nodes to sample from different feature-spaces, or with different support and precision constraints. We note also that the new version of Apache Cassandra uses a peer-to-peer setup and it would be useful to investigate the implementation of the algorithm we have proposed on a real, distributed system with commodity components. This will allow us to validate results obtained in simulation on gains in time when using the distributed approach. Experimentally, we recognize that results on more real-world datasets are always desirable: we hope the results here will provide the impetus to explore distributed feature construction by ILP on many more real datasets.