Making sense of sensory input

This paper attempts to answer a central question in unsupervised learning: what does it mean to"make sense"of a sensory sequence? In our formalization, making sense involves constructing a symbolic causal theory that explains the sensory sequence and satisfies a set of unity conditions. This model was inspired by Kant's discussion of the synthetic unity of apperception in the Critique of Pure Reason. On our account, making sense of sensory input is a type of program synthesis, but it is unsupervised program synthesis. Our second contribution is a computer implementation, the Apperception Engine, that was designed to satisfy the above requirements. Our system is able to produce interpretable human-readable causal theories from very small amounts of data, because of the strong inductive bias provided by the Kantian unity constraints. A causal theory produced by our system is able to predict future sensor readings, as well as retrodict earlier readings, and"impute"(fill in the blanks of) missing sensory readings, in any combination. We tested the engine in a diverse variety of domains, including cellular automata, rhythms and simple nursery tunes, multi-modal binding problems, occlusion tasks, and sequence induction IQ tests. In each domain, we test our engine's ability to predict future sensor values, retrodict earlier sensor values, and impute missing sensory data. The Apperception Engine performs well in all these domains, significantly out-performing neural net baselines. We note in particular that in the sequence induction IQ tasks, our system achieved human-level performance. This is notable because our system is not a bespoke system designed specifically to solve IQ tasks, but a general purpose apperception system that was designed to make sense of any sensory sequence.


Introduction
Imagine a machine, equipped with various sensors, that receives a stream of sensory information.It must, somehow, make sense of this stream of sensory data.But what does it mean, exactly, to "make sense" of sensory data?We have an intuitive understanding of what is involved in making sense of the sensory stream -but can we specify precisely what is involved?Can this intuitive notion be formalized? of the sensory stream.In fact, our system is able to predict, retrodict, and impute simultaneously. 7(4) The Apperception Engine has been tested in a diverse variety of domains, with encouraging results.The five domains we use are elementary cellular automata, rhythms and nursery tunes, "Seek Whence" and C-test sequence induction intelligence tests [24], multimodal binding tasks, and occlusion problems.These tasks were chosen because they require cognition rather than mere classificatory perception, and because they are simple for humans but not for modern machine learning systems, e.g.neural networks. 8The Apperception Engine performs well in all these domains, significantly out-performing neural net baselines.These results are significant because neural systems typically struggle to solve the binding problem (where information from different modalities must somehow be combined into different aspects of one unified object) and fail to solve occlusion tasks (in which objects are sometimes visible and sometimes obscured from view).
We note in particular that in the sequence induction intelligence tests, our system achieved human-level performance.This is notable because the Apperception Engine was not designed to solve these induction tasks; it is not a bespoke handengineered solution to this particular domain.Rather, it is a general-purpose 9 system that attempts to make sense of any sensory sequence.This is, we believe, a highly suggestive result [25].
In ablation tests, we tested what happened when each of the four unity conditions was turned off.Since the system's performance deteriorates noticeably when each unity condition is ablated, this indicates that the unity conditions are indeed doing vital work in our engine's attempts to make sense of the incoming barrage of sensory data.

Related work
A human being who has built a mental model of the world can use that model for counterfactual reasoning, anticipation, and planning [26][27][28].Similarly, computer agents endowed with mental models are able to achieve impressive performance in a variety of domains.For instance, Lukasz Kaiser et al. [29] show that a model-based reinforcement learning agent trained on 100K interactions compares with a state-of-the-art model-free agent trained on tens or hundreds of millions of interactions.David Silver et al. [30] have shown that a model-based Monte Carlo tree search planner with policy distillation can achieve superhuman level performance in a number of board games.The tree search relies, crucially, on an accurate model of the game dynamics.
When we have an accurate model of the environment, we can leverage that model to anticipate and plan.But in many domains, we do not have an accurate model.If we want to apply model-based methods in these domains, we must learn a model from the stream of observations.In the rest of this section, we shall describe various different approaches to representing and learning models, and show where our particular approach fits into the landscape of model learning systems.
Before we start to build a model to explain a sensory sequence, one fundamental question is: what form should the model take?We shall distinguish three dimensions of variation of models (adapted from [31]): first, whether they simply model the observed phenomena, or whether they also model latent structure; second, whether the model is explicit and symbolic or implicit; and third, what type of prior knowledge is built into the model structure.
We shall use the hidden Markov model (HMM) 10 [32,33] as a general framework for describing sequential processes.Here, the observation at time t is x t , and the latent state is z t .In a HMM, the observation x t at time t depends only on the latent (unobserved) state z t .The state z t in turn depends only on the previous latent state z t−1 .
The first dimension of variation amongst models is whether they actually use latent state information z t to explain the observation x t .Some approaches [34][35][36][37][38][39] assume we are given the underlying state information z 1:t .In these approaches, there is no distinction between the observed phenomena and the latent state: x i = z i .With this simplifying assumption, the only thing a model needs to learn is the transition function.Other approaches [40,2,41] focus only on the observed phenomena x 1:t and ignore latent information z 1:t altogether.These approaches predict observation x t+1 given observation x t without positing any hidden latent structure.Some approaches take latent information seriously [42][43][44]5,45].These jointly learn a perception function (that produces a latent z t from an observed x t ), a transition function (producing a next latent state z t+1 from latent state z t ) and a rendering function (producing a predicted observation x t+1 from the latent state z t+1 ).Our approach also builds a latent representation of the state.As well as positing latent properties (unobserved properties that explain observed phenomena), we also posit latent objects (unobserved objects whose relations to observed objects explain observed phenomena).
The second dimension of variation concerns whether the learned model is explicit, symbolic and human-readable, or implicit and inscrutable.In some approaches [42][43][44]5], the latent states are represented by vectors and the dynamics of the model by weight tensors.In these cases, it is hard to understand what the system has learned.In other approaches [46][47][48][49], the latent state is represented symbolically, but the state transition function is represented by the weight tensor of a neural network and is inscrutable.We may have some understanding of what state the machine thinks it is in, but we do not understand why it thinks there is a transition from this state to that.In some approaches [11,12,[50][51][52][53], both the latent state and the state transition function are represented symbolically.Here, the latent state is a set of ground atoms 11 and the state transition function is represented by a set of universally quantified rules.Our approach falls into this third category.Here, the model is fully interpretable: we can interpret the state the machine thinks it is in, and we can understand the reason why it believes it will transition to the next state.
A third dimension of variation between models is the amount and type of prior knowledge that they include.Some model learning systems have very little prior knowledge.In some of the neural systems (e.g.[2]), the only prior knowledge is the spatial invariance assumption implicit in the convolutional network's structure.Other models incorporate prior knowledge about the way objects and states should be represented.For example, some models assume objects can be composed in hierarchical structures [47].Other systems additionally incorporate prior knowledge about the type of rules that are used to define the state transition function.For example, some [51][52][53] use prior knowledge of the event calculus [54].Our approach falls into this third category.We impose a language bias in the form of rules used to define the state transition function and also impose additional requirements on candidate sets of rules: they must satisfy the four unity conditions introduced above (and elaborated in Section 3.3 below).
To summarize, in order to position our approach within the landscape of other approaches, we have distinguished three dimensions of variation.Our approach differs from neural approaches in that the posited theory is explicit and human readable.Not only is the representation of state explicit (represented as a set of ground atoms) but the transition dynamics of the system are also explicit (represented as universally quantified rules in a domain specific language designed for describing causal structures).Our approach differs from other inductive program synthesis methods in that it posits significant latent structure in addition to the induced rules to explain the observed phenomena: in our approach, explaining a sensory sequence does not just mean constructing a set of rules that explain the transitions; it also involves positing a type signature containing a set of latent properties and a set of latent objects.Our approach also differs from other inductive program synthesis methods in the type of prior knowledge that is used: as well as providing a strong language bias by using a particular representation language (a typed extension of datalog with causal rules and constraints), we also inject a substantial inductive bias: the unity conditions, the key constraints on our system, represent domain-independent prior knowledge.Our approach also differs from other inductive program synthesis methods in being entirely unsupervised.In contrast, OSLA and OLED [51,52] are supervised, and SPLICE [53] is semi-supervised.See Section 7 for detailed discussion.

Paper outline
Section 2 introduces basic notation.Section 3 presents the main definition of what it means for a theory to count as a unified interpretation of a sensory sequence.Section 4 describes a computer system that is able to generate unified interpretations of sensory sequences.Section 5 describes our experiments in five different types of task: elementary cellular automata, rhythms and nursery tunes, "Seek Whence" sequence induction tasks, multi-modal binding tasks, and occlusion problems.In Section 6, we show how our system is extended to robustly handle noise.Related work is discussed in Section 7.

Background
In this paper, we use basic concepts and standard notation from logic programming [55][56][57].A function-free atom is an expression of the form p(t 1 , ..., t n ), where p is a predicate of arity n ≥ 0 and each t i is either a variable or a constant.We shall use a, b, c, ... for constants, X, Y , Z , ... for variables, and p, q, r, ... for predicate symbols.
A substitution σ is a mapping from variables to terms.For example σ = {X/a, Y /b} replaces variable X with constant a and replaces variable Y with constant b.We write ασ for the application of substitution σ to atom α, so e.g.p( X, Y )σ =

p(a, b).
A Datalog clause is a definite clause of the form α 1 ∧ ... ∧ α n → α 0 where each α i is an atom and n ≥ 0. It is traditional to write clauses from right to left: α 0 ← α 1 , ..., α n .In this paper, we will define a Datalog interpreter implemented in another logic programming language, ASP (answer-set programming).In order to keep the two languages distinct, we write Datalog rules from left to right and ASP clauses from right to left.A Datalog program is a set of Datalog clauses.
A key result of logic programming is that every Datalog program has a unique subset-minimal least Herbrand model that can be directly computed by repeatedly generating the consequences of the ground instances of the clauses [58].
We turn now from Datalog to normal logic programs under the answer set semantics [59].A literal is an atom α or a negated atom not α.A normal logic program is a set of clauses of the form: α 0 ← α 1 , ..., α n where α 0 is an atom, α 1 , ..., α n is a conjunction of literals, and n ≥ 0. Normal logic clauses extend Datalog clauses by allowing functions in terms and by allowing negation by failure in the body of the rule.
Answer Set Programming (ASP) is a logic programming language based on normal logic programs under the answer set semantics.Given a normal logic program, an ASP solver finds the set of answer sets for that program.Modern ASP solvers can also be used to solve optimization problems by the introduction of weak constraints [60].A weak constraint is a rule that defines the cost of a certain tuple of atoms.Given a program with weak constraints, an ASP solver can find a preferred answer set with the lowest cost.

A computational framework for making sense of sensory sequences
What does it mean to make sense of a sensory sequence?In this section, we formalize what this means, before describing our computer implementation.We assume that the sensor readings have already been discretized into ground atoms of firstorder logic, so a sensory reading featuring sensor a can be represented by a ground atom p(a) for some unary predicate p, or by an atom r(a, b) for some binary relation r and unique value b.12 Definition 1.An unambiguous symbolic sensory sequence is a sequence of sets of ground atoms.Given a sequence S = (S 1 , S 2 , ...), every state S t in S is a set of ground atoms, representing a partial description of the world at a discrete time step t.An atom p(a) ∈ S t represents that sensor a has property p at time t.An atom r(a, b) ∈ S t represents that sensor a is related via relation r to value b at time t.If G is the set of all ground atoms, then S ∈ 2 G * .
Example 1. Consider, the following sequence S 1:10 .Here there are two sensors a and b, and each sensor can be either on or off .There is no expectation that a sensory sequence contains readings for all sensors at all time steps.Some of the readings may be missing.In state S 5 , we are missing a reading for a, while in state S 9 , we are missing a reading for b.In states S 1 and S 10 , we are missing sensor readings for both a and b.
The central idea is to make sense of a sensory sequence by constructing a unified theory that explains that sequence.The key notions, here, are "theory", "explains", and "unified".We consider each in turn.

The theory
Theories are defined in a new language, Datalog ⊃ -, designed for modelling dynamics.In this language, one can describe how facts change over time by writing a causal rule stating that if the antecedent holds at the current time-step, then the consequent holds at the next time-step.Additionally, our language includes a frame axiom allowing facts to persist over time: each atom remains true at the next time-step unless it is overridden by a new fact which is incompatible with it.Two facts are incompatible if there is a constraint that precludes them from both being true.Thus, Datalog ⊃ -extends Datalog with causal rules and constraints.

Definition 2.
A theory is a four-tuple (φ, I, R, C ) of Datalog ⊃ -elements where: • φ is a type signature specifying the types of constants, variables, and arguments of predicates We shall consider each element in turn, starting with the type signature.Definition 3. Given a set T of types, a set O of constants representing individual objects, and a set P of predicates representing properties and relations, let G be the set of all ground atoms formed from T , O, and P. Given a set V of variables, let U be the set of all unground atoms formed from T , V, and P.
A type signature is a tuple (T , O , P , V ) where T ⊆ T is a finite set of types, O ⊆ O is a finite set of constants representing objects, P ⊆ P is a finite set of predicates representing properties and relations, and V ⊆ V is a finite set of variables.
We write κ O : O → T for the type of an object, κ P : P → T * for the types of the predicate's arguments, and κ V : V → T for the type of a variable.Now some type signatures are suitable for some sensory sequences, while others are unsuitable, because they do not contain the right constants and predicates.The following definition formalizes this: Definition 4. Let G S = t≥1 S t be the set of all ground atoms that appear in sensory sequence S = (S 1 , ...).Let G φ be the set of all ground atoms that are well-typed according to type signature φ.
for all i = 1..n}.A type signature φ is suitable for a sensory sequence S if all the atoms in S are well-typed according to signature φ, i.e.G S ⊆ G φ .
Next, we define the set of unground atoms for a particular type signature.Definition 5. Let U φ be the set of all unground atoms that are well-typed according to signature φ.
for all i = 1..n}.Note that, according to this definition, an atom is unground if all its terms are variables.Note that "unground" means more than simply not ground.For example, p(a, X) is neither ground nor unground.
Example 2. One suitable type signature for the sequence of Example 1 is (T , O , P , V ), consisting of types T = {s}, objects O = {a:s, b:s}, predicates P = {on(s), off (s)}, and variables V = {X:s, Y :s}.Here, and throughout, we write a:s to mean that object a is of type s, on(s) to mean that unary predicate on takes one argument of type s, and X:s to mean that variable X is of type s.The unground atoms are U φ = {on(X), off (X), on(Y ), off (Y )}.There are, of course, an infinite number of other suitable signatures.Definition 6.The initial conditions I of a theory (φ, I, R, C ) is a set of ground atoms from G φ representing a partial description of the facts true at the initial time step.
The initial conditions are needed to specify the initial values of the latent unobserved information.Some systems (e.g.LFIT [12]) define a predictive model without using a set I of initial conditions.These systems are able to avoid positing initial conditions because they do not use latent unobserved information.But any system that does invoke latent information beneath the surface of the sensory stimulations must also define the initial values of the latent information.
The rules define the dynamics of the theory: There are two types of rule in Datalog ⊃ -.A static rule is a definite clause of the form α 1 ∧ ... ∧ α n → α 0 , where n ≥ 0 and each α i is an unground atom from U φ consisting of a predicate and a list of variables.Informally, a static rule is interpreted as: if conditions α 1 , ...α n hold at the current time step, then α 0 also holds at that time step.A causal rule is a clause of the form α 1 ∧ ... ∧ α n ⊃ -α 0 , where n ≥ 0 and each α i is an unground atom from U φ .A causal rule expresses how facts change over time.Rule α 1 ∧ ... ∧ α n ⊃ -α 0 states that if conditions α 1 , ...α n hold at the current time step, then α 0 holds at the next time step.
All variables in rules are implicitly universally quantified.So, for example, on( X)⊃ -off (X) states that for all objects X , if X is currently on, then X will become off at the next-time step.The constraints rule out certain combinations of atoms13 : Definition 8.There are three types of constraint in Datalog ⊃ -.A unary constraint is an expression of the form ∀X, p 1 (X) ⊕ ... ⊕ p n (X), where n > 1, meaning that for all X , exactly one of p 1 (X), ..., p n (X) holds.A binary constraint is an expression of the form ∀X, ∀Y , r 1 (X, Y ) ⊕ ... ⊕ r n (X, Y ) where n > 1, meaning that for all objects X and Y , exactly one of the binary relations hold.A uniqueness constraint is an expression of the form ∀X, ∃!Y :t 2 , r( X, Y ), which means that for all objects X of type t 1 there exists a unique object Y such that r( X, Y ).
Note that the rules and constraints are constructed entirely from unground atoms.Disallowing constants prevents specialcase rules that apply to particular objects, and forces the theory to be general. 14

Explaining the sensory sequence
A theory explains a sensory sequence if the theory generates a trace 15 that covers that sequence.In this section, we explain the trace and the covering relation.

Definition 9.
Every theory θ = (φ, I, R, C ) generates an infinite sequence τ (θ) of sets of ground atoms, called the trace of that theory.Here, τ (θ) = (A 1 , A 2 , ...), where each A t is the smallest set of atoms satisfying the following conditions: and there is no atom in A t that is incompossible with α w.r.t constraints C , then α ∈ A t .
Two ground atoms are incompossible if there is some constraint c in C and some substitution σ such that the ground constraint cσ precludes both atoms being true.
The frame axiom is a simple way of providing inertia: a proposition continues to remain true until something new comes along which is incompatible with it.Including the frame axiom makes our theories much more concise: instead of needing rules to specify all the atoms which remain the same, we only need rules that specify the atoms that change.
Note that the state transition function is deterministic: A t is uniquely determined by A t−1 .
Theorem 1.The trace of every theory repeats after some finite number of steps.For any theory θ , there exists a k such that τ (θ) = Proof.Since the set G φ of ground atoms is finite, there must be a k such that A 1 = A k .The proof proceeds by induction on i.If i = 0, the proof is trivial.When i > 0, note that the trace function τ satisfies the Markov condition that the next state A t+1 depends only on the current state A t , and not on any earlier states.Hence if One important consequence of Theorem 1 is: Theorem 2. Given a theory θ and a ground atom α, it is decidable whether α appears somewhere in the infinite trace τ (θ).
Next we define what it means for a theory to "explain" a sensory sequence.In providing a theory θ that explains a sensory sequence S, we make S intelligible by placing it within a bigger picture: while S is a scanty and incomplete description of a fragment of the time-series, τ (θ) is a complete and determinate description of the whole time-series.
Example 3. We shall provide a theory to explain the sensory sequence S of Example 1.
Consider the type signature φ = (T , O , P , V ), consisting of types T = {s}, objects O = {a:s, b:s}, predicates P = {on(s), off (s), p 1 (s), p 2 (s), p 3 (s), r(s, s)}, and variables V = {X:s, Y :s}.Here, φ extends the type signature of Example 2 by adding three unary predicates p 1 , p 2 , p 3 , and one binary relation r. 16   Consider the theory θ = (φ, I, R, C ), where: ⎭ 16 Extended type signatures are generated by the machine, not by hand.Our computer implementation searches through the space of increasingly complex type signatures extending the original signature.This search process is described in Section 4.1.
The infinite trace τ (θ) = (A 1 , A 2 , ...) for theory θ begins with: Note that the trace repeats at step 4. In fact, it is always true that the trace repeats after some finite set of time steps.
Theory θ explains the sensory sequence S of Example 1, since the trace τ (θ) covers S. Note that τ (θ) "fills in the blanks" in the original sequence S, both predicting final time step 10, retrodicting initial time step 1, and imputing missing values for time steps 5 and 9.

Unifying the sensory sequence
Next, we proceed from explaining a sensory sequence to "making sense" of that sequence.In order for θ to make sense of S, it is necessary that τ (θ) covers S.But this condition is not, on its own, sufficient.The extra condition that is needed for θ to count as "making sense" of S is for θ to be unified.We require that the constituents of the theory are integrated into a coherent whole.A trace τ (θ) of theory θ is a (i) sequence of (ii) sets of ground atoms composed of (iii) predicates and (iv) objects.For the theory θ to be unified is for unity to be achieved at each of these four levels If this condition is satisfied, it means that given any object, we can get to any other object by hopping along relations.Everything is connected, even if only indirectly.
Note that this notion of spatial unity is rather abstract: the requirement is only that every pair of objects are indirectly connected via some chain of binary relations.Although some of these binary relations might be spatial relations (e.g."leftof"), they need not all be.The requirement is only that every pair of objects are connected via some chain of binary relations; it does not insist that each binary relation has a specifically "spatial" interpretation.

Conceptual unity
A theory satisfies conceptual unity if every predicate is involved in some constraint, either exclusive disjunction (⊕) or unique existence (∃!).The intuition here is that constraints combine predicates into clusters of mutual incompatibility.Definition 13.A theory θ = (φ, I, R, C ) satisfies conceptual unity if for each unary predicate p in φ, there is some xor constraint in C of the form ∀X:t, p( X) ⊕ q( X) ⊕ ... containing p; and, for each binary predicate r in φ, there is some xor constraint in C of the form ∀X: To see the importance of this, observe that if there are no constraints, then there are no exhaustiveness or exclusiveness relations between atoms.An xor constraint e.g.∀X:t, on( X) ⊕ off (X) both rules out the possibility that an object is simultaneously on and off (exclusiveness) and also rules out the possibility that an object of type t is neither on nor off (exhaustiveness).It is exhaustiveness which generates states that are determinate, in which it is guaranteed every object of type t is e.g.either on or off .It is exclusiveness which generates incompossibility between atoms, e.g. that on(a) and off (a) are incompossible.Incompossibility, in turn, is needed to constrain the scope of the frame axiom (see Definition 9 above).Without incompossibility, all atoms from the previous time-step would be transferred to the next time-step, and the set of true atoms in the sequence (S 1 , S 2 , ...) would grow monotonically over time: S i ⊆ S j if i ≤ j, which is clearly unacceptable.
The purpose of the constraint of conceptual unity is to collect predicates into groups, to provide determinacy in each state, and to ground the incompossibility relation that constrains the way information persists between states. 18

Static unity
In our effort to interpret the sensory sequence, we construct various ground atoms.These need to be grouped together, somehow, into states (sets of atoms).But what determines how these atoms are grouped together into states?
Treating a set A of ground atoms as a state is (i) to insist that A satisfies all the constraints in C and (ii) to insist that A is closed under the static rules in R: Definition 14.A theory θ = (φ, I, R, C ) satisfies static unity if every state (A 1 , A 2 , ...) in τ (θ) satisfies all the constraints in C and is closed under the static rules in R.
Static unity is an uncontroversial requirement and is used in other ILP systems [67,68].Note that, from the definition of the trace in Definition 9, all the states in τ (θ) are automatically closed under the static rules in R.

Temporal unity
Given a set of states, we need to unite these elements in a sequence.According to the fourth and final condition of unity, the only thing that can unite states in a sequence is a set of causal rules.These causal rules are universal in two senses: they apply to all object tuples, and they apply at all times.A causal rule α 1 ∧ ... ∧ α n ⊃ -α 0 fixes the temporal relation between the atoms α 1 , ..., α n (which are true at t) and the atom α 0 (which is true at t + 1): Definition 15.A sequence (A 1 , A 2 , ...) of states satisfies temporal unity with respect to a set R ⊃ -of causal rules if, for each Temporal unity is an uncontroversial requirement and is also used in other ILP systems such as LFIT [12].Note that, from the definition of the trace in Definition 9, the trace τ (θ) automatically satisfies temporal unity.

The four conditions of unity
To recap, the trace of a theory is a sequence of sets of atoms.The four types of element are objects, predicates, sets of atoms, and sequences of sets of atoms.Each of the four types of element has its own form of unity: 1. Spatial unity: objects are united in space by being connected via chains of relations 2. Conceptual unity: predicates are united by constraints 3. Static unity: atoms are united in a state by jointly satisfying constraints and static rules 4. Temporal unity: states are united in a sequence by causal rules Since temporal unity is automatically satisfied from the definition of a trace in Definition 9, we are left with only three unity conditions that need to be explicitly checked: spatial unity, conceptual unity, and static unity.A trace partially satisfies static unity since the static rules are automatically enforced by Definition 9; but the constraints are not necessarily satisfied.
Note that both checking spatial unity and checking static unity require checking every time-step, and the trace is infinitely long.However, as long as the trace repeats at some point, Theorem 1 ensures that we need only check the finite portion of the trace until we find the first repetition (the first k such that A 1 = A k where τ (θ) = (A 1 , ...)).

Making sense
Now we are ready to define the central notion of "making sense" of a sequence.Definition 16.A theory θ makes sense of a sensory sequence S if θ explains S, i.e. S τ (θ), and θ satisfies the four conditions of unity of Definition 11.If θ makes sense of S, we also say that θ is a unified interpretation of S. exclusiveness, and one normal clause representing their exhaustiveness?The main reason we chose to add exclusive disjunction as a first-class language construct in Datalog ⊃ -, rather than adding negation as failure, is that it means we can avoid the complexities involved in the semantics if we added negation as failure to our target language.There are various semantics for normal logic programs that include negation as failure (e.g.Clark completion [65], stable model semantics [59], well-founded models [66]), but each of them introduces significant additional complexities when compared with the least model of a definite logic program: the Clark completion is not always consistent (does not always have a model), the stable model semantics assigns the meaning of a normal logic program to a set of models rather than a single model, and the well-founded model uses a 3-valued logic where atoms can be true, false, or undefined.Thus, the main reason for expressing constraints using exclusive disjunction (rather than using negation as failure) is to restrict the rules to definite rules and avoid the complexities of the various semantics of normal logic programs.
2. The predicates of θ are on, off , p 1 , p 2 , p 3 , r.Here, on and off are involved in the constraint ∀X:s, on( X) ⊕ off (X), while p 1 , p 2 , p 3 are involved in the constraint ∀X:s, p 1 (X) ⊕ p 2 (X) ⊕ p 3 (X), and r is involved in the constraint ∀X:s, ∃!Y :s r( X, Y ).Hence, θ makes sense of sensory sequence S of Example 1, since S τ (θ) (Example 3) and θ also satisfies the four conditions of unity.

Let τ (θ) = (A
In our search for interpretations that make sense of sensory sequences, we are particularly interested in parsimonious interpretations.To this end, we define the cost of a theory 19 : Here, cost(θ ) is just the total number of ground atoms in I plus the total number of unground atoms in the rules of R.
The key notion of this section is the discrete apperception task.
Definition 18.The input to an apperception task is a triple (S, φ, C ) consisting of a sensory sequence S, a suitable type signature φ, and a set C of (well-typed) constraints such that (i) each predicate in S appears in some constraint in C and (ii) S can be extended to satisfy C : there exists a sequence S covering S such that each state in S satisfies each constraint in C .
Given such an input triple (S, φ, C ), the discrete apperception task is to find the lowest cost theory θ = (φ , I, R, C ) such that φ extends φ, C ⊇ C , and θ makes sense of S.
Note that the input to an apperception task is more than just a sensory sequence S. It also contains a type signature φ and a set C of constraints.A natural question at this point is: why not simply let the input to an apperception task be just the sequence S, and ask the system to produce some theory θ satisfying the unity conditions such that S τ (θ)?The reason that the input needs to contain types φ and constraints C to supplement S is that otherwise the task is severely under-constrained, as the following example shows.
Example 5. Suppose our sequence is S = ({on(a)}, {off (a)}, {on(a)}, {off (a)}, {on(a)}, {off (a)}).If we are not given any constraints (such as ∀X : t, on( X) ⊕ off (X)), if we are free to construct any φ and any set C of constraints, then the following interpretation θ = (φ, I, R, C ) will suffice, where φ = (T , O , P , V ), consisting of types T = {t}, objects O = {a:t}, predicates P = {on(t), off (t), p(t), q(t)}, and variables V = {X:t}, and suppose that I, R, C are defined as: Here we have introduced two latent predicates p and q which are incompatible with on and off respectively.But in this interpretation, on and off are not incompatible with each other, so the degenerate interpretation (where both on and off are true at all times) is acceptable.This shows the need for including constraints on the input predicates as part of the task formulation.
More generally, for any sensory sequence (S 1 , ..., S T ) featuring predicates p 1 , ..., p n , but no constraints between p 1 , ..., p n , we can always construct a degenerate interpretation by adding new predicates q 1 , ..., q n with an xor constraint ∀X : t, p i (X) ⊕ q i (X) between each predicate p i and the corresponding new predicate q i .In the degenerate interpretation, the initial conditions I are S 1 ∪ ... ∪ S T , and the rules R are empty.This shows that, without constraints on the predicates appearing in the initial sequence, the problem is underspecified.
The apperception task can be generalized to the case where we are given as input, not a single sensory sequence S, but a set of m such sequences.
Definition 19.Given a set {S 1 , ..., S m } of sensory sequences, a type signature φ and constraints C such that each (S i , φ, C ) is a valid input to an apperception task as defined in Definition 18, the generalized apperception task is to find a lowestcost theory (φ , {}, R, C ) and sets {I 1 , ..., I m } of initial conditions such that φ extends φ, C ⊇ C , and for each i = 1..m, (φ , I i , R, C ) makes sense of S i .

Examples
In this section, we provide a worked example of an apperception task, along with different unified interpretations.We wish to highlight that there are always many alternative ways of interpreting a sensory sequence, each with different latent information (although some may have higher cost than others).
We continue to use our running example, the sensory sequence from Example 1.Here there are two sensors a and b, and each sensor can be on or off .
Examples 6, 7, and 8 below show three different unified interpretations of Example 1.
Example 6.One possible way of interpreting Example 1 is as follows.The sensors a and b are simple state machines that cycle between states p 1 , p 2 , and p 3 .Each sensor switches between on and off depending on which state it is in.When it is in states p 1 or p 2 , the sensor is on; when it is in state p 3 , the sensor is off.In this interpretation, the two state machines a and b do not interact with each other in any way.Both sensors are following the same state transitions.The reason the sensors are out of sync is because they start in different states.
Our first unified interpretation is the tuple (φ , I, R, C ), where: The update rules R contain three causal rules (using ⊃ -) describing how each sensor cycles from state p 1 to p 2 to p 3 , and then back again to p 1 .For example, the causal rule p 1 (X)⊃ -p 2 (X) states that if sensor X satisfies p 1 at time t, then X satisfies p 2 at time t + 1.We know that X is a sensor from the variable typing information in φ .R also contains three static rules (using →) describing how the on or off attribute of a sensor depends on its state.For example, the static rule p 1 (Y ) → on( X) states that if X satisfies p 1 at time t, then X also satisfies on at time t.
The constraints C state that (i) every sensor is (exclusively) either on or off , that every sensor is (exclusively) either p 1 , p 2 , or p 3 , and that every sensor has exactly one sensor that is related by r to it.The binary r predicate, or something like it, is needed to satisfy the constraint of spatial unity.
In this first interpretation, three new predicates are invented (p 1 , p 2 , and p 3 ) to represent the three states of the state machine.In the next interpretation, we will introduce new invented objects instead of invented predicates.
Given the initial conditions I and the update rules R, we can use our interpretation to compute which atoms hold at which time step.In this case, τ (θ) = (A 1 , A 2 , ...) where S i A i .Note that this trace repeats: As well as being able to predict future values, we can retrodict past values (filling in A 1 ), or interpolate intermediate unknown values (filling in A 5 or A 9 ). 20But although an interpretation provides the resources to "fill in" missing data, it has no particular bias to predicting future time-steps.The conditions which it is trying to satisfy (the unity conditions of Section 3.3) do not explicitly insist that an interpretation must be able to predict future time-steps.Rather, the ability to predict the future (as well as the ability to retrodict the past, or interpolate intermediate values) is a derived capacity that emerges from the more fundamental capacity to "make sense" of the sensory sequence.
Example 7.There are always infinitely many different ways of interpreting a sensory sequence.Next, we show a rather different interpretation of S 1:10 from that of Example 6.In our second unified interpretation, we no longer see sensors a and b as self-contained state-machines.Now, we see the states of the sensors as depending on their left and right neighbours.In this new interpretation, we no longer need the three invented unary predicates (p 1 , p 2 , and p 3 ), but instead introduce a new object.
Object invention is much less explored than predicate invention in inductive logic programming.But Dietterich et al. [71] anticipated the need for it, and Inoue [18] uses meta-level abduction to posit unperceived objects.
In this new interpretation, imagine there is a one-dimensional cellular automaton with three cells, a, b, and (unobserved) c.The three cells wrap around: the right neighbour of a is b, the right neighbour of b is c, and the right neighbour of c is a.In this interpretation, the spatial relations are fixed.(We shall see another interpretation later where this is not the case.)The cells alternate between on and off according to the following simple rule: if X 's left neighbour is on (respectively off) at t, then X is on (respectively off) at t + 1.
Note that objects a and b are the two sensors we are given, but c is a new unobserved latent object that we posit in order to make sense of the data.Many interpretations follow this pattern: new latent unobserved objects are posited to make sense of the changes to the sensors we are given.
Note further that part of finding an interpretation is constructing the spatial relation between objects; this is not something we are given, but something we must construct.In this case, we posit that the imagined cell c is inserted to the right of b and to the left of a.
We represent this interpretation by the tuple (φ , I, R, C ), where: Here, φ extends φ, C extends C , and the interpretation satisfies the unity conditions.
Example 8. We shall give one more way of interpreting the same sensory sequence, to show the variety of possible interpretations.
In our third interpretation, we will posit three latent cells, c 1 , c 2 , and c 3 that are distinct from the sensors a and b.Cells have static attributes: each cell can be either black or white, and this is a permanent unchanging feature of the cell.Whether a sensor is on or off depends on whether the cell it is currently contained in is black or white.The reason why the sensors change from on to off is because they move from one cell to another.
Our new type signature (T , O , P , V ) distinguishes between cells and sensors as separate types: T = {cell, sensor}, O = {a : sensor, b : sensor, c 1 : cell, c 2 : cell, c 3 : cell}, P = {on(sensor), off (sensor), part(sensor, cell), r(cell, cell), black(cell), white(cell)}, and V = {X : sensor, Y : cell, Y 2 : cell}.Our interpretation is the tuple (φ, I, R, C ), where: The update rules R state that the on or off attribute of a sensor depends on whether its current cell is black or white.They also state that the sensors move from right-to-left through the cells.
In this interpretation, there is no state information in the sensors.All the variability is explained by the sensors moving from one static object to another.
Here, the sensors move about, so spatial unity is satisfied by different sets of atoms at different time-steps.For example, at time-step 1, sensors a and b are indirectly connected via the ground atoms part(a, c 1 ), r(c 1 , c 2 ), part(b, c 2 ).But at timestep 2, a and b are indirectly connected via a different set of ground atoms part(a, c 3 ), r(c 3 , c 1 ), part(b, c 1 ).Spatial unity requires all pairs of objects to always be connected via some chain of ground atoms at each time-step, but it does not insist that it is the same set of ground atoms at each time-step.
Examples 6, 7, and 8 provide different ways of interpreting the same sensory input.In Example 6, the sensors are interpreted as self-contained state machines.Here, there are no causal interactions between the sensors: each is an isolated machine.In Examples 7 and 8, by contrast, there are causal interactions between the sensors.In Example 7, the on and off attributes move from left to right along the sensors.In Example 8, it is the sensors that move, not the attributes, moving from right to left.The difference between these two interpretations is in terms of what is moving and what is static.
Note that the interpretations of Examples 6, 7, and 8 have costs 16, 12, and 17 respectively.So the theory of Example 7, which invents an unseen object, is preferred to the other theories that posit more complex dynamics.

Properties of interpretations
In this section, we provide some results about the correctness properties of unified interpretations.The next theorem shows that all objects appearing in sensor readings satisfy object permanence.Theorem 3.For each sensory sequence S = (S 1 , ..., S t ) and each unified interpretation θ of S, for each object x that features in S (i.e. x appears in some ground atom p(x) or q(x, y) in some state S i in S), for each state

.), x features in A i . In other words, if x features in any state in S, then x features in every state in τ (θ).
Proof.Let θ = (φ, I, R, C ) and φ = (T , O , P , V ).Since object x features in sequence S, there exists some atom α involving x in some state S j in (S 1 , ..., S t ).Since θ is an interpretation, S τ (θ), and hence α ∈ (τ (θ)) j .Consider the two possible forms of α: 1. α = p(x).Since θ satisfies conceptual unity, there must be a constraint involving p of the form ∀X : t, p( X) ⊕ q 1 (X)...

.) and consider any
x features in A i .2. α = q(x, y) for some y.Since θ satisfies conceptual unity, there must be a constraint involving q.This constraint can either be (i) a binary constraint of the form ∀X :

.) and consider any A i in τ (θ). Since θ satisfies static unity, A i satisfies each constraint in C and in particular
For case (ii), again let τ (θ) = (A 1 , A 2 , ...) and consider any A i in τ (θ).Since θ satisfies static unity, A i satisfies each constraint in C and in particular , Y ).Therefore there must be some y such that κ O (y) = t 2 and q(x, y) ∈ A i .
Note that this proof can be extended to apply to latent objects as well.Any object appearing in any atom in some state A i must also appear in every other state A j .
Theorem 3 provides some guarantee that admissible interpretations that satisfy the unity conditions will always be acceptable in the minimal sense that they always provide some value for each sensor.This theorem is important because it justifies the claim that a unified interpretation will always be able to support prediction (of future values), retrodiction (of previous values), and imputation (of missing values).
Note that this theorem does not imply that the predicate of the atom in which x appears is one of the predicates appearing in the sensory sequence S. It is entirely possible that it is some distinct predicate that appears in φ but has never been observed in S. The following example illustrates this possibility.
The next theorem is a form of "completeness", showing that every sensory sequence has some admissible interpretation that satisfies the unity conditions.Theorem 4. For every apperception task (S, φ, C ) there exists some interpretation θ = (φ , I, R, C ) that makes sense of S, where φ extends φ and C ⊇ C .
Proof.We construct a degenerate theory that simply memorizes the sequence it has seen.First, we define φ given φ = (T , O , P , V ).For each sensor x i that features in S, i = 1..n, and each state S j in S, j = 1..m, create a new unary predicate Second, we define θ = (φ , I, R, C ).Let the initial conditions I be: Let the rules R contain the following causal rules for i = 1..n and j = 1..m − 1 (where x i is of type t): together with the following static rules for each unary atom q(x i ) ∈ S j : and the following static rules for each binary atom r(x i , x k ) ∈ S j (where x i is of type t and x k is of type t ): We augment C to C by adding the following additional constraints.Let P t be the unary predicates for all objects of type t: Then for each type t add a unary constraint: It is straightforward to check that θ as defined satisfies the constraint of conceptual unity, that the constraints C are satisfied by each state in τ (θ), and that the sensory sequence is covered by τ (θ).
Note that the interpretation provided by Theorem 4 is degenerate and unilluminating: it treats each object entirely separately (failing to capture any regularities between objects' behaviour) and treats every time-step entirely separately (failing to capture any laws that hold over multiple time-steps).This unilluminating interpretation provides an upper bound on the complexity of the theory needed to make sense of the sensory sequence.

Computer implementation
The Apperception Engine is our system for solving apperception tasks. 21Given as input an apperception task (S, φ, C ), the engine searches for a type signature φ and a theory θ = (φ , I, R, C ) where φ extends φ, C ⊇ C and θ makes sense of S. In this section, we describe how it is implemented.Definition 20.A template is a structure for circumscribing a large but finite set of theories.It is a type signature together with constants that bound the complexity of the rules in the theory.Formally, a template χ is a tuple (φ, where φ is a type signature, N → is the max number of static rules allowed in R, N ⊃ -is the max number of causal rules allowed in R, and N B is the max number of atoms allowed in the body of a rule in R. Each template χ specifies a large (but finite) set of theories that conform to χ .Let χ ,C ⊂ be the subset of theories (φ, I, R, C ) in that conform to χ and where C ⊇ C .

Algorithm 1:
The Apperception Engine algorithm in outline.
input : (S, φ, C ), an apperception task output : θ * , a unified interpretation of S (s * , θ * ) ← (max(float), nil) Note that the relationship between the complexity of a template and the cost of a theory satisfying the template is not always simple.Sometimes a theory of lower cost may be found from a template of higher complexity.This is why we cannot terminate as soon as we have found the first theory θ .We must keep going, in case we later find a lower cost theory from a more complex template.
The two non-trivial parts of this algorithm are the way we enumerate templates, and the way we find the lowest-cost theory θ for a given template χ .We consider each in turn.

Iterating through templates
We need to enumerate templates in such a way that every template is (eventually) visited by the enumeration.Since the objects, predicates, and variables are typed (see Definition 3), the acceptable ranges of O , P , and V depend on T .Because of this, our enumeration procedure is two-tiered: first, enumerate sets T of types; second, given a particular T , enumerate (O , P , V , N → , N ⊃ -, N B ) tuples for that particular T .We cannot, of course, enumerate all (O , P , V , N → , N ⊃ -, N B ) tuples because there are infinitely many.Instead, we specify a constant bound (n) on the number of tuples, and gradually increase that bound: In order to enumerate (T , n) pairs, we use a standard diagonalization procedure.See Table 1.
function takes a finite number n of (possibly infinite) lists, and produces a (possibly infinite) list of n-tuples, generating a n-way Cartesian product that is guaranteed to eventually produce every such n-tuple.
We use choices to generate 6-tuples (O , P , V , N → , N ⊃ -, N B ) tuples by creating six infinite streams: (i) S O : an infinite list of finite lists of typed objects, (ii) S P : an infinite list of finite lists of typed predicates, (iii) S V : an infinite list of finite lists of typed variables, (iv) S → = {0, 1, ...}: the number of static rules, (v) S ⊃ -= {0, 1, ...}: the number of causal rules, and (vi) S B = {0, 1, ...}: the max number of body atoms.Now when we pass this list of streams to the choices function, it produces an enumeration of the 6-way Cartesian product Example 11.Recall the apperception problem from Example 1.There are two sensors a and b, and each sensor can be on or off .We use the template enumeration procedure described above to generate increasingly complex templates χ 1 , χ 2 , ..., using χ 0 as a base.
Later, the Apperception Engine finds another solution using the type signature φ = (T , O , P , V ) (again, the augmented parts of the type signature are in bold): Here, it has constructed an invented object o 1 :sensor and posited a one-dimensional spatial relationship r 1 between the three sensors.This solution is recognisable as a variant of Example 7 above.

Finding the best theory from a template
The most complex part of Algorithm 1 is: Here, we search for a theory θ with the lowest cost (see Definition 17) such that θ conforms to the template χ and includes the constraints in C , such that τ (θ) covers S, and θ satisfies the conditions of unity.In this sub-section, we explain in outline how this works.
Our approach combines abduction and induction to generate a unified interpretation θ . 22See Fig. 1.Here, X ⊆ G is a set of facts (ground atoms), P : G → G is a procedure for generating the consequences of a set of facts, and Y ⊆ G is the result of applying P to X .If X and P are given, and we wish to generate Y , then we are performing deduction.If P and Y are given, and we wish to generate X , then we are performing abduction.If X and Y are given, and we wish to generate P , then we are performing induction.Finally, if only Y is given, and we wish to generate both X and P , then we are jointly performing abduction and induction.This is what the Apperception Engine does. 23ur method is described in Algorithm 2. In order to jointly abduce a set I (of initial conditions) and induce sets R and C (of rules and constraints), we implement a Datalog ⊃ -interpreter in ASP.See Section 4.3 for the details.This interpreter takes a set I of atoms (represented as a set of ground ASP terms) and sets R and C of rules and constraints (represented again as a set of ground ASP terms), and computes the trace of the theory τ (θ) = (S 1 , S 2 , ...) up to a finite time limit.
Concretely, we implement the interpreter as an ASP program π τ that computes τ (θ) for theory θ .We implement the conditions of unity as ASP constraints in a program π u .We implement the cost minimization as an ASP program π m that counts the number of atoms in each rule plus the number of initialisation atoms in I , and uses an ASP weak constraint [60] to minimize this total.Then we generate ASP programs representing the sequence S, the initial conditions, the rules and constraints.We combine the ASP programs together and ask the ASP solver (clingo [73]) to find a lowest cost solution.(There may be multiple solutions that have equally lowest cost; the ASP solver chooses one of the optimal answer sets.)We extract a readable interpretation θ from the ground atoms of the answer set.In Section 4.3, we explain how Algorithm 2 is implemented in ASP.In Section 4.4, we evaluate the computational complexity.In Section 4.5, we describe the various optimisations used to prune the search.In Section 5.2.4,we compare with ILASP, a state-of-the-art ILP system.Algorithm 2: Finding the lowest cost θ for sequence S and template χ .Here, π τ computes the trace, π u checks that the unity conditions are satisfied, and π m minimizes the cost of θ .input : S, a sensory sequence input : χ = (φ, N → , N ⊃ -, N B ), a template input : C , a set of constraints on the predicates of the sensory sequence output : θ , the simplest unified interpretation of S that conforms to χ π S ← gen_input(S)

The ASP encoding
Our Datalog ⊃ -interpreter is written in ASP.All elements of Datalog ⊃ -, including variables, are represented by ASP constants.A variable X is represented by a constant var_x, and a predicate p is represented by a constant c_p.Elements of the target language are reified in ASP, so an unground atom p( X) of Datalog ⊃ -is represented by a term s(c_p, var_x), and a rule is represented by a set of ground atoms for the body, and a single ground atom for the head.For example, the static rule p( X) ∧ q( X, Y ) → r(Y ) is represented as: rule_body(r1, s(c_p, var_x)).rule_body(r1, s2(c_q, var_x, var_y)).rule_head_static(r1, s(c_r, var_y)).
Given a type signature φ, we construct ASP terms that represent every well-typed unground atom in U φ , and wrap these terms in the is_var_atom predicate.For example, to represent that r(C , C 2 ) is a well-typed unground atom, we write is_var_atom(atom(c_r, var_c, var_c2)).Similarly, we construct ASP terms that represent every well-typed ground atom in G φ using the is_ground_atom predicate.
Since τ (θ) is an infinite sequence (A 1 , A 2 , ...), we cannot compute the whole of it.Instead, we only compute the sequence up to the max time of the original sensory sequence S.
Note that the frame axiom uses negation as failure and strong negation [64] to check that some other incompatible atom has not already been added.Thus, the frame axiom is not restrictive and can be overridden as needed by the causal rules, to handle predicates that are not inertial.
The conditions of unity described in Section 3.3 are represented directly as ASP constraints in π u .For example, spatial unity is encoded as: :-spatial_unity_counterexample(X, Y, T).
Here, there is a counterexample to spatial unity at time T if objects X and Y are not related: spatial_unity_counterexample(X, Y, T) :-is_object(X), is_object(Y), is_time(T), not related(X, Y, T).
Here, related is the reflexive symmetric transitive closure of the relation holding between X and Y if there is some relation R connecting them.Note that related can quantify over relations because the Datalog ⊃ -atoms and predicates have been reified into terms: related(X, Y, T) :holds(s2(R, X, Y), T). related(X, X, T) :-is_object(X), is_time(T).
When constructing a theory θ = (φ, I, R, C ), the solver needs to choose which ground atoms to use as initial conditions in I , which static and causal rules to include in R, and which xor or uniqueness conditions to use as conditions in C .
To allow the solver to choose what to include in I , we add the ASP choice rule to π I : { init(A) } :-is_ground_atom(A).
To allow the solver to choose which rules to include in R, we add the following clauses to π R : Here, k_max_body is the N B parameter of the template that specifies the max number of body atoms in any rule.The number of rules satisfying is_static_rule and is_causes_rule is determined by the parameters N ⊃ -and N → in the template (see Definition 20).
The ASP program π m minimizes the cost of the theory θ (see Definition 17) by using weak constraints [60]:

Table 3
The number of ground clauses in the ASP encoding of Algorithm 2.

Complexity
This section describes the complexity of Algorithm 2. We assume basic concepts and standard terminology from complexity theory.Let P be the class of problems that can be solved in polynomial time by a deterministic Turing machine, N P be the class of problems solved in polynomial time by a non-deterministic Turing machine, and E X P T I M E be the class of problems solved in time 2 n d by a deterministic Turing machine.Let P i+1 = N P P i be the class of problems that can be solved in polynomial time by a non-deterministic Turing machine with a P i oracle.Finding a solution to an ASP program is in NP [74,75], while finding an optimal solution to an ASP program with weak constraints is in P 2 [76,77].Since deciding whether a non-disjunctive ASP program has a solution is in NP [74,75], our ASP encoding of Algorithm 2 shows that finding a unified interpretation θ for a sequence given a template is in NP.Since verifying whether a solution to an ASP program with preferences is indeed optimal is in P 2 [76,77], our ASP encoding shows that finding the lowest cost theory is in P 2 .However, the standard complexity results assume the ASP program has already been grounded into a set of propositional clauses.To really understand the space and time complexity of Algorithm 2, we need to examine how the set of ground atoms in the ASP encoding grows as a function of the parameters in the template χ Observe that, since we restrict ourselves to unary and binary predicates, the number of ground and unground atoms is a small polynomial function of the type signature parameters 24 : But note that the number of substitutions φ that is compatible with the signature φ is an exponential function of the number of variables V : Thus, in our ASP encoding, finding the lowest cost theory is in P 2 , but the number of ground propositional clauses grows exponentially with |O |, the number of variables allowed in a rule.Table 2 shows the number of ground clauses for the three most expensive clauses.

Optimization
Because of the combinatorial complexity of the apperception task, we had to introduce a number of optimizations to get reasonable performance on even the simplest of domains.

Reducing grounding with type checking
We use the type signature φ to dramatically restrict the set of ground atoms (G φ ), the unground atoms (U φ ), the substitutions ( φ ), and rules R φ .Type-checking has been shown to drastically reduce the search space in program synthesis tasks [78].

Symmetry breaking
We use symmetry breaking to remove candidates that are equivalent.We remove programs that are equivalent up to a variable renaming by using a strict ordering on variables.We also remove programs that are equivalent up to a reordering of the rules by using a strict ordering on unground atoms.

Adding redundant constraints
ASP programs can be significantly optimized by adding redundant constraints (constraints that are provably entailed by the other clauses in the program) [79].We speeded up solving time (by about 30%) by adding the following redundant constraints: :-init(A), init(B), incompossible(A, B).

Five experimental domains
To evaluate the generality of our system, we tested it in a variety of domains: elementary (one-dimensional) cellular automata, drum rhythms and nursery tunes, sequence induction tasks, multi-modal binding tasks, and occlusion tasks.These particular domains were chosen because they represent a diverse range of tasks that are simple for humans but are hard for state-of-the-art machine learning systems.The tasks were chosen to highlight the difference between mere perception (the classification tasks that machine learning systems already excel at) and apperception (assimilating information into a coherent integrated theory, something traditional machine learning systems are not designed to do).

Results
We implemented the Apperception Engine in Haskell and ASP.We used clingo [73] to solve the ASP programs generated by our system.We ran all experiments with a time-limit of 4 hours on a standard Unix desktop.
Our experiments (on the prediction task) are summarised in Table 4.Note that our accuracy metric for a single task is rather exacting: the model is accurate (Boolean) on a task iff every hidden sensor value is predicted correctly. 25

Table 4
Results for prediction tasks on five domains.We show the mean information size of the sensory input, to stress the scantiness of our sensory sequences.We also show the mean information size of the held-out data.Our metric of accuracy for prediction tasks is whether the system predicted every sensor's value correctly.

Domain
Tasks  not score any points for predicting most of the hidden values correctly.As can be seen from Table 4, our system is able to achieve good accuracy across all five domains.

Elementary cellular automata
An Elementary Cellular Automaton (ECA) [80,81] is a one-dimensional Cellular Automaton.The world is a circular array of cells.Each cell can be either on or off .The state of a cell depends only on its previous state and the previous state of its left and right neighbours.
Fig. 2 shows one set of ECA update rules. 26Each update specifies the new value of a cell based on its previous left neighbour, its previous value, and its previous right neighbour.The top row shows the values of the left neighbour, previous value, and right neighbour.The bottom row shows the new value of the cell.There are 8 updates, one for each of the 2 3 configurations.In the diagram, the leftmost update states that if the left neighbour is on, and the cell is on, and its right neighbour is on, then at the next time-step, the cell will be turned off .Given that each of the 2 3 configurations can produce on or off at the next time-step, there are 2 2 3 = 256 total sets of update rules.
Given update rules for each of the 8 configurations, and an initial starting state, the trajectory of the ECA is determined.Fig. 3 shows the state sequence for Rule 110 above from one initial starting state of length 11.
In our experiments, we attach sensors to each of the 11 cells, produce a sensory sequence, and then ask our system to find an interpretation that makes sense of the sequence.For example, for the state sequence of Fig. 3, the sensory sequence is (S 1 , ..., S 10 ) where: 26 This particular set of update rules is known as Rule 110.Here, 110 is the decimal representation of the binary 01101110 update rule, as shown in Fig. 2.This rule has been shown to be Turing-complete [81].
Results.Given the 256 ECA rules, all with the same initial configuration, we treated the trajectories as a prediction task and applied our system to it.Our system was able to predict 249/256 correctly.In each of the 7/256 failure cases, the Apperception Engine found a unified interpretation, but this interpretation produced a prediction which was not the same as the oracle.For example, the dynamics found for Fig. 3 These rules exactly capture the dynamics of the cells that change.The other cells retain their value from the previous time-step, according to the frame axiom of Definition 9.
The initial conditions found by the Apperception Engine describe the initial values of the cells, and also specify the latent r relation between cells: Here, the system uses r( X, Y ) to mean that cell Y is immediately to the right of cell X .Note that the system has constructed the spatial relation itself.It was not given the spatial relation r between cells.All it was given was the sensor readings of the 11 cells.It constructed the spatial relationship r between the cells in order to make sense of the data.

Drum rhythms and nursery tunes
We also tested our system on simple melodies and rhythms.Here, each sensor is an auditory receptor that is tuned to listen for a particular note or drum beat.In the tune tasks, there is one sensor for C , one for D, one for E, all the way to HighC.(There are no flats or sharps.)In the rhythm tasks, there is one sensor listening out for bass drum, one for snare drum, and one for hi-hat.Each sensor can distinguish four loudness levels, between 0 and 3. When a note is pressed, it starts at max loudness (3), and then decays down to 0. Multiple notes can be pressed simultaneously.
For example, the Twinkle Twinkle Little Star tune generates the following sensor readings (assuming 8 time-steps for a bar): Results.Recall that our accuracy metric is stringent and only counts a prediction as accurate if every sensor's value is predicted correctly.In the rhythm and music domain, this means the Apperception Engine must correctly predict the loudness value (between 0 and 3) for each of the sound sensors.There are 8 sensors for tunes and 3 sensors for rhythms.When we tested the Apperception Engine on the 20 drum rhythms and 10 nursery tunes, our system was able to predict 22/30 correctly.Note that the interpretations found are large and complex programs by the standards of state-of-the-art ILP systems.In Three Blind Mice, for example, the interpretation contained 10 update rules and 34 initialisation atoms making a total of 44 clauses.The reason we are able to find such large programs is that our ASP encoding represents the space of possible programs much more efficiently than other encodings (see Section 5.2.4).

Seek Whence and C-test sequence induction intelligence tests
Hofstadter introduced the Seek Whence27 domain in [24].The task is, given a sequence s 1 , ..., s t of symbols, to predict the next symbol s t+1 .Typical Seek Whence tasks include 28 : a,a,b,a,b,c,a,b,c,d,a b,a,b,b,b,b,b,c,b,b,d,b,b,e, ... a,b,b,c,c,c,d,d,d,d,e, ...  a,f,e,f,a,f,e,f,a,f,e,f,a a,b,b,c,c,d,d,e,e, ...  a,b,c,c,d,d,e,e,e,f,f,f, ...  f,a,f,b,f,c,f,d,f, ...  a,f,e,e,f,a,a,f,e,e,f,a,a, ...  b,b,b,c,c,b,b,b,c,c,b,b,b,c,c, ... b,a,a,b,b,b,a,a,a,a,b,b,b,b,b, ...  b,c,a,c,a,c,b,d,b,d,b,c,a,c,a, ... a,b,b,c,c,d,d,e,e,f,f, ...  a,a,b,a,b,c,a,b,c,d,a,b,c,d,e, ... b,a,c,a,b,d,a,b,c,e,a,b,c,d,f, ...  a,b,a,c,b,a,d,c,b,a,e,d,c,b, ...  c,b,a,b,c,b,a,b,c,b,a,b,c,b, ...  a,a,a,b,b,c,e,f,f, ...  a,a,b,a,a,b,c,b,a,a,b,c,d,c,b, ...  a,a,b,c,a,b,b,c,a,b,c,c,a,a,a, ... a,b,a,b,a,b,a,b,a, ...  a,c,b,d,c,e,d, ...  a,c,f,b,e,a,d, ... a,a,f,f,e,e,d,d a,a,b,b,f,a,b,b,e,a,b,b,d, ...  f,a,d,a,b,a,f,a,d,a,b,a, ...  a,b,a,f,a,a,e,f,a Results.Given the 30 Seek Whence sequences, we treated the trajectories as a prediction task and applied our system to it.Our system was able to predict 23/30 correctly.For the 7 failure cases, 4 of them were due to the system not being able to find any unified interpretation within the memory and time limits, while in 3 of them, the system found a unified interpretation that produced the "incorrect" prediction (Fig. 4).
The first key point we want to emphasise here is that our system was able to achieve human-level performance 29 on these tasks without hand-coded domain-specific knowledge. 30This is a general system for making sense of sensory data that, when applied to the Seek Whence domain, is able to solve these particular problems.The second point we want to stress is that our system did not learn to solve these sequence induction tasks after seeing many previous examples. 31n the contrary: our system had never seen any such sequences before; it confronts each sequence de novo, without prior experience.This system is, to the best of our knowledge, the first such general system that is able to achieve such a result.

Binding tasks
The binding problem [87] is the task of recognising that information from different sensory modalities should be collected together as different aspects of a single external object.For example, you hear a buzzing in your auditory field and you see an insect in your visual field.How do you associate the buzzing and the insect-appearance as aspects of one single object?
To investigate how our system handles such binding problems, we tested it on the following multi-modal variant of the ECA described above.Here, there are two types of sensor.The light sensors have just two states: black and white, while the touch sensors have four states: fully un-pressed (0), fully pressed (3), and two intermediate states (1,2).After a touch sensor is fully pressed (3), it slowly depresses, going from states 2 to 1 to 0 over 3 time-steps.In this example, we chose Rule 110 (the Turing-complete ECA rule) with the same initial configuration as in Fig. 3, as described earlier.In this multi-modal variant, there are 11 light sensors, one for each cell in the ECA, and two touch sensors on cells 3 and 11.See Fig. 5.
Results.We ran 20 multi-modal binding experiments, with different ECA rules, different initial conditions, and the touch sensors attached to different cells.The engine achieved 85% accuracy.

Occlusion tasks
Neural nets that predict future sensory data conditioned on past sensory data struggle to solve occlusion tasks because it is hard to inject into them the prior knowledge that objects persist over time.Our system, by contrast, was designed to posit latent objects that persist over time.
To test our system's ability to solve occlusion problems, we generated a set of tasks of the following form: there is a 2D grid of cells in which objects move horizontally.Some move from left to right, while others move from right to left, with wrap around when they get to the edge of a row.The objects move at different speeds.Each object is placed in its own row, so there is no possibility of collision.There is an "eye" placed at the bottom of each column, looking up.Each eye can only  see the objects in the column it is placed in.An object is occluded if there is another object below it in the same column.See Fig. 6.
The system receives a sensory sequence consisting of the positions of the moving objects whenever they are visible.The position of the objects when they are occluded is used as held-out test data to verify the predictions of the model.This is an imputation task.
Results.We generated 20 occlusion tasks by varying the size of the grid, the number of moving objects, their direction and speed.The Apperception Engine achieved 90% accuracy.

Empirical evaluations
In this section, we test our system to evaluate the truth of the following hypotheses: 1.The five domains of section 5.1 are challenging tasks for existing baselines.2. Our system handles retrodiction and imputation just as easily as prediction.
3. The features of our system (unity conditions, cost minimisation) are essential to its performance.4. Our system outperforms state-of-the-art inductive logic programming approaches in the five domains.
We consider each in turn.

The five domains of section 5.1 are challenging tasks for existing baselines
To evaluate whether our domains are indeed sufficiently challenging, we compared our system against four baselines.The first constant baseline always predicts the same constant value for every sensor for each time-step.The second inertia baseline always predicts that the final hidden time-step equals the penultimate time-step.The third MLP baseline is a fully-connected multilayer perceptron (MLP) [88] that looks at a window of up to 10 earlier time-steps to predict the next time-step.The fourth LSTM baseline is a recurrent neural net based on the long short-term memory (LSTM) architecture [89].
The neural baselines are designed to exploit potential statistical patterns that are indicative of hidden sensor states.In the MLP baseline, we formulate the problem as a multi-class classification problem, where the input consists in a feature representation x of relevant context sensors, and a feed-forward network is trained to predict the correct state y of a given sensor in question.In the prediction task, the feature representation comprises one-hot 32 representations for the state of every sensor in the previous two time steps before the hidden sensor.The training data consists of the collection of all observed states in an episode (as potential hidden sensors), together with the respective history before.Samples with incomplete history window (at the beginning of the episode) are discarded.
The MLP classifier is a 2-layer feed-forward neural network, which is trained on all training examples derived from the current episode (thus no cross-episode transfer is possible).We restrict the number of hidden neurons to (20,20) for the two layers, respectively, in order to prevent overfitting given the limited number of training points within an episode.We use a learning rate of 10 −3 and train the model using the Adam optimiser [90] for up to 200 epochs, holding aside 10% of data for early stopping.
Given that the input is a temporal sequence, a recurrent neural network (that was designed to model temporal dynamics) is a natural choice of baseline.But we found that the LSTM performs only slightly better than the MLP on Seek Whence tasks, and worse on the other tasks.The reason for this is that the paucity of data (a single temporal sequence consisting of a small number of time-steps) does not provide enough information for the high capacity LSTM to learn desirable gating behaviour.The simpler and more constrained MLP with fewer weights is able to do slightly better on some of the tasks, yet both neural baselines achieve low accuracy in absolute terms.Fig. 7 shows the results.Clearly, the tasks are very challenging for all four baseline systems.

Our system handles retrodiction and imputation just as easily as prediction
To evaluate whether our system is just as capable of retrodicting earlier values and imputing missing intermediate values as it is at predicting future values, we ran tests where the unseen hidden sensor values were at the first time step (in the case of retrodiction) or randomly scattered through the time-series (in the case of imputation). 33We made sure that the number of hidden sensor values was the same for prediction, retrodiction, and imputation.Fig. 8 shows the results.The results are significantly lower for retrodiction in the ECA tasks, but otherwise comparable.The reason for retrodiction's lower performance on ECA is that for a particular initial configuration there are a significant number (more than 50%) of the ECA rules that wipe out all the information in the current state after the first state transition, and all subsequent states then remain the same.The results for imputation are comparable with the results for prediction.Although the results for rhythm and music are lower, the results on Seek Whence are slightly higher. 32A one-hot representation of feature i of n possible features is a vector of length n in which all the elements are 0 except the i'th element. 33A deterministic transition dynamic is a function from a set of ground atoms to a set of ground atoms.If this function is not injective, then information is lost as we go through time: there will be a unique next step from the current time-step, but there can be multiple previous steps that transition into the current time-step.Our search procedure uses maximum a posteriori (MAP) estimation: we find a single model with the highest posterior (based on the likelihood -how well it explains the sequence, and on the prior -the program length), and uses that model to predict, retrodict, and impute.A more ambitious Bayesian approach would construct a probability distribution over rival theories, and use a mixture model for retrodiction.But, given the computation complexity of finding a single solution (see Section 4.4), this ambitious approach is -in the short term at least -prohibitively expensive.Fig. 8. Comparing prediction with retrodiction and imputation.In retrodiction, we display accuracy on the held-out initial time-step.In imputation, a random subset of atoms are held-out; the held-out atoms are scattered throughout the time-series.In other words, there may be different held-out atoms at different times.The number of held-out atoms in imputation matches the number of held-out atoms in prediction and retrodiction.

The features of our system are essential to its performance
To verify that the unity conditions are doing useful work, we performed a number of experiments in which the various conditions were removed, and compared the results.We ran four ablation experiments.In the first, we removed the check that the theory's trace covers the input sequence: S τ (θ) (see Definition 16).In the second, we removed the check on conceptual unity.Removing this condition means that the unary predicates are no longer connected together via exclusion relations ⊕, and the binary predicates are no longer constrained by ∃! conditions.(See Definition 13.)In the third ablation test, we removed the check on spatial unity.Removing this condition means allowing objects which are not connected via binary relations.In the fourth ablation test, we removed the cost minimization part of the system.Removing this minimization means that the system will return the first interpretation it finds, irrespective of size.
The results of the ablation experiments are displayed in Table 5.The first ablation test, where we remove the check that the generated sequence of sets of ground atoms respects the original sensory sequence (S τ (θ)), performs very poorly.Of course, if the generated sequence does not cover the given part of the sensory sequence, it is highly unlikely to accurately predict the held-out part of the sensory sequence.This test is just a sanity check that our evaluation scripts are working as intended.
The second ablation test, where we remove the check on conceptual unity, also performs poorly.The reason is that without constraints, there are no incompossible atoms.Recall from Definition 9 that two atoms are incompossible if there is some ⊕ constraint or some ∃! constraint that means the two atoms cannot be simultaneously true.But in Definition 9, the frame axiom forces an atom that was true at the previous time-step to also be true at the next time-step unless the old atom is incompossible with some new atom: we add α to H t if α is in H t−1 and there is no atom in H t that is incompossible with α.But if there are no incompossible atoms, then all previous atoms are always added.Therefore, if there are no ⊕ and ∃! constraints, then the set of true atoms monotonically increases over time.This in turn means that state information becomes meaningless, as once something becomes true, it remains always true, and cannot be used to convey information.
When we remove the spatial unity constraint, the results for the rhythm tasks are identical, but the results for the ECA and Seek Whence tasks are lower.The reason why the results are identical for the rhythm tasks is because the background knowledge provided (the r relation on notes, see Section 5.1.3)means that the spatial unity constraint is guaranteed to be satisfied.The reason why the results are lower for ECA tasks is because interpretations that fail to satisfy spatial unity contain disconnected clusters of cells (e.g. cells {c 1 , ..., c 5 } are connected by r in one cluster, while cells {c 6 , ..., c 11 } are connected in another cluster, but {c 1 , ..., c 5 } and {c 6 , ..., c 11 } are disconnected).Interpretations with disconnected clusters tend to generalize poorly and hence predict with less accuracy.The reason why the results are only slightly lower for the Seek Whence tasks is because the lowest cost unified interpretation for most of these tasks also happens to satisfy spatial unity.
The results for the fourth ablation test, where we remove the cost minimization, are broadly comparable with the full system in ECA and Seek Whence, but are markedly worse in the rhythm / music tasks.But even if the results were comparable in all tasks, there are independent reasons to want to minimize the size of the interpretation.Shorter interpretations are more human-readable, and transfer better to new situations (since they tend to be more general, as they have fewer atoms in the bodies of the rules).

Our system outperforms state-of-the-art inductive logic programming approaches in the five domains
In order to assess the efficiency of our system, we compared it to ILASP 34 [96][97][98][99], a state-of-the-art Inductive Logic Programming algorithm. 35Unlike traditional ILP systems that learn definite logic programs, ILASP learns answer set programs. 36 ILASP is a powerful and general framework for learning answer set programs; it is able to learn choice rules, constraints, and even preferences over answer sets [97].
ILASP is able to solve some simple apperception tasks.For example, ILASP is able to solve the task in Example 6.But for the ECA tasks, the music and rhythm tasks, and the Seek Whence tasks, the ASP programs generated by ILASP were not solvable because they required too much memory.
In order to understand the memory requirements of ILASP on these tasks, and to compare our system with ILASP in a fair like-for-like manner, we looked at the size of the grounded ASP programs.Recall that both our system and ILASP generate ASP programs that are then grounded into propositional clauses that are then passed to a SAT solver.The grounding size determines the memory usage and is strongly correlated with solution time.
We took a representative ECA rule, Rule 245, and looked at the grounding size as the number of cells increased from 2 to 11.We used the same template for both ILASP and the Apperception Engine.The results are in Fig. 9.
As we increase the number of cells, the grounding size of the ILASP program grows much faster than the corresponding Apperception Engine program.The reason for this marked difference is the different ways the two approaches represent rules.In our system, rules are interpreted by an interpreter that operates on reified representations of rules.In ILASP, by contrast, rules are compiled into ASP rules.This means, if there are |U φ | unground atoms and there are at most N B atoms in the body of a rule, then ILASP will generate |U φ | N B +1 different clauses.When it comes to grounding, if there are | φ | substitutions and t time-steps, then ILASP will generate at most |U φ | N B +1 • | φ | • t ground instances of the generated clauses.Each ground instance will contain N B + 1 atoms, so there are Compare this with our system.Here, we do not represent every possible rule explicitly as a separate clause.Rather, we represent the possible atoms in the body of a rule by an ASP choice rule: 0 { rule_body(R, VA) : is_unground_atom(VA) } k_max_body :-is_rule(R). 34We compared against ILASP rather than Metagol (another state-of-the-art inductive logic programming system [91,92]) because (i) ILASP is comparable in performance (it achieved slightly better results than Metagol in the Inductive General Game Playing task suite [93], getting 40% correct as opposed to Metagol's 36%), and (ii) since ILASP also uses ASP we can compare the grounding size of our program with ILASP and get a fair apples-for-apples comparison.We used ILASP rather than HEXMIL (the ASP implementation of Metagol [94]) because of scaling problems with HEXMIL [95].Our decision to use ILASP rather than Metagol for these tests was based on a number of discussions with Andrew Cropper (the developer of Metagol) and Mark Law (the developer of ILASP).We are very grateful to both for their advice on this. 35Strictly speaking, ILASP is a family of algorithms, rather than a single algorithm.We used ILASP2 [97] in this evaluation. 36Answer set programming under the stable model semantics is distinguished from traditional logic programming in that it is purely declarative and each program has multiple solutions (known as answer sets).Because of its non-monotonicity, ASP is well suited for knowledge representation and commonsense reasoning [100,101].
If there are N → static rules and N ⊃ -causal rules, then this choice rule only generates N → + N ⊃ -ground clauses, each containing |U φ | atoms.
The most expensive clauses in our encoding are analysed in Table 3. Recall from Section 4.4 that the total number of atoms in the ground clauses is approximately 5 To compare this with ILASP, let us set N B = 4 (which is representative).Then ILASP generates ground clauses with 5 • |U φ | 5 • | φ | • t ground atoms while our system generates clauses with 5 The reason, then, why our system has such lower grounding sizes than ILASP is because (N → + N ⊃ -) << |U φ | 4 .Intuitively, the key difference is that ILASP considers every possible subset of the hypothesis space, while our system (by restricting to at most N → + N ⊃ -rules) only considers subsets of length at most N → + N ⊃ -.

Noisy apperception
So far, we have assumed that our sensor readings are entirely noise-free: some of the sensory readings may be missing, but none of the readings are inaccurate.
If we give the Apperception Engine a sensory sequence with mislabelled data, it will struggle to provide a theoretical explanation of the mislabelled input.Consider, for example, S 1:20 : If we give sequences such as this to the Apperception Engine, it attempts to make sense of all the input, including the anomalies.In this case, it finds the following baroque explanation: Here, the Apperception Engine has introduced three new invented predicates c 1 , c 2 , c 3 in order to count how many p's it has seen, so that it knows when to switch to q.If we move the anomalous entry q(a) later in the sequence, or add further anomalies, the engine is forced to construct increasingly complex theories.This is clearly unsatisfactory.In order to handle noisy mislabelled data, we shall relax our insistence that the sequence S 1:T is entirely covered by the trace of the theory θ .Instead of insisting that S τ (θ), we shall minimise the number of discrepancies between each S i and τ (θ) i , for i = 1..T .
We want to find the most probable theory θ given our noisy input sequence S 1:T : By Bayes' rule, this is equivalent to arg max Since the denominator does not depend on θ , this is equivalent to: Since the probability of the state S i is conditionally independent of the previous state S i−1 given θ (this is the assumption behind the Hidden Markov Model), the above is equivalent to: Now each S i depends only on τ (θ) i , the trace of θ at time step i.Thus we can rewrite to: Since we assume a description length prior over theories, p(θ ) ∝ 2 −len(θ) .Let the probability of S i given τ (θ Thus, we define the cost noise of the theory to be: and search for the theory with lowest cost.
Example 12. Consider, for example, the following sequence S 1:10 : Because the sequence is so short, the lowest cost noise theory is: This degenerate empty theory has cost 14 (the number of atoms in S) which is shorter than any "proper" explanation that captures the regularities.But as the sequence gets longer, the advantage of a "proper" explanation over a degenerate solution becomes more and more apparent.Consider, for example, the following extension S 1:30 : We can see, then, that the noise-robust version of the Apperception Engine is somewhat less data-efficient than the noiseintolerant version described earlier.

Experiments
We used the following sequences to compare the noise-intolerant Apperception Engine with the noise-robust version:  a,a,b,b,a,a,b,b,a,a,b,b, ...  a,a,a,b,a,a,a,b,a,a,a,b, ...  a,b,b,a,a,b,b,a,a,b,b,a, ...  a,b,c,a,b,c,a,b,c,a,b,c, ...  a,b,c,b,a,a,b,c,b,a,a,b,c,b,a, ... a,b,a,c,a,b,a,c,a,b,a,c, ...  a,b,c,c,a,b,c,c,a,b,c,c, ...  a,a,b,b,c,c,a,a,b,b,c,c, ... We chose these particular sequences because they are simple, noise-free, and the Apperception Engine is able to solve them in a reasonably short time.
We performed two groups of experiments.In the first, we evaluated how much longer the sequence needs to be for the noise-robust version to capture the underlying regularity, in comparison with the noise-intolerant version which is more data-efficient.Fig. 10 shows the results.We plot mean percentage accuracy (over the ten sequences) against the length of the sequence that is provided to the Apperception Engine.Note that the noise-intolerant version only needs sequences of length 10 to achieve 100% accuracy, while the noise-tolerant version needs sequences of length 30. 37n the second experiment, we evaluate how much better the noise-robust version of the Apperception Engine is at handling mislabelled data.We take the same ten sequences above, extended to length 100, and consider various perturbations  We plot mean percentage accuracy against the number of mislabellings.The noise-intolerant version deteriorates to random as soon as any noise is introduced, while the noiserobust version is able to maintain reasonable (88%) accuracy with up to 30% of the sequence mislabelled.
of the sequence where we randomly mislabel a certain number of entries.Fig. 11 shows the results.We plot mean percentage accuracy (over the ten sequences) against the percentage of mislabellings.Note that the noise-intolerant version deteriorates to random as soon as any noise is introduced, while the noise-robust version is able to maintain reasonable accuracy with up to 30% of the sequence mislabelled.

Related work
In this section, we describe particular systems that are related to our approach.For a general overview of the space of different approaches, see Section 1.1.

"Theory learning as stochastic search in a language of thought"
Ullman et al. [67,68] describe a system for learning first-order rules from symbolic data.Recasting their approach into our notation, their system is given as input a set S of ground atoms, 38 and it searches for a set of static rules R and a set I of atoms such that R, I |= S.
Of course, the task as just formulated admits of entirely trivial solutions: for example, let I = S and R = {}.The idea is that if S contains structural regularities, their system will find an R and I that are much simpler than the degenerate solution above.Consider, for example, the various surface relations in a family tree: John is the father of William; William is the husband of Anne; Anne is the mother of Judith; John is the grandfather of Judith.All the various surface relations (father, mother, husband, grandfather...) can be explained by a small number of core relations: parent( X, Y ), spouse( X, Y ), male( X), and female( X).Now the surface facts S = {father(john, william), ...} can be explained by a small number of facts involving core predicates I = {parent(john, william), male(john), ...} together with rules such as: father(X, Y ) ← parent(X, Y ), male(X) At the computational level, then, the task that Ullman et al. set out to solve is: given a set S of ground atoms featuring surface predicates, find the smallest set I of ground atoms featuring only core predicates, and the smallest set R of static rules, such that R, I |= S. Recasting this task in the language of probability, they wish to find: arg max When it is currently considering search element x, it generates a candidate next element x by randomly perturbing x.Then it compares the scores of x and x .If x is better, it switches attention to focus on x .Otherwise, if x is worse than x, there is still a non-zero probability of switching (to avoid local but the probability is lower when x' is significantly worse than the current search element x.
In their algorithm, MCMC is applied at two levels.At the first level, a set R of rules is perturbed into R by adding or removing atoms from clauses, or by switching one predicate for another predicate with the same arity.At the second level, I is perturbed into I by changing the extension of the core predicates.
Given that the search space of sets of rules is so enormous, and that MCMC is a stochastic search procedure that only operates locally, the algorithm needs additional guidance to find solutions.In their case, they provide a template, a set of meta-rules that constrain the types of rules that are generated in the outermost MCMC loop.A meta-rule is a higher-order clause in which the predicates are themselves variables.For example, in the following meta-rule for transitivity, P is a variable ranging over two-place predicates: Meta-rules are a key component in many logic program synthesis systems [91,102,103,96,99].
Ullman et al. tested their system in a number of domains including taxonomy hierarchies, simplified magnetic theories, kinship relations, and psychological explanations of action.In each domain, their system is able to learn human-interpretable theories from small amounts of data.
At a high level, Ullman et al.'s system has much in common with the Apperception Engine.They are both systems for generating interpretable explanations from small quantities of symbolic data.While the Apperception Engine generates a (φ, I, R, C ) tuple from a sequence (S 1 , ..., S T ), their system generates an (I, R) pair from a single set S of atoms.But there are a number of significant differences.First, and most importantly, our system learns causal dynamics from time series, while their system only learns static rules.Second, our system posits latent objects as well as latent predicates, while their system only posits latent predicates.The ability to imagine unobserved objects, with unobserved attributes that explain the observed attributes of observed objects, is a key feature of the Apperception Engine.Third, their system requires handengineered templates in order to find a theory that explains the input.This reliance on hand-engineered templates restricts the domain of application of their technique: in a domain in which they do not know, in advance, the structure of the rules they want to learn, their system will not be applicable.Fourth, a unified interpretation θ = (φ, I, R, C ) in our system includes a set C of constraints.These constraints play a critical role in our system: they are both regulative (ruling out certain incompossible combinations of atoms) and constitutive (the constraints determine the incompossible relation that in turn grounds the frame axiom).There is no equivalent of our constraints C in their system.A fifth key difference is that our system has to produce a theory that, as well as explaining the sensory sequence, also has to satisfy the unity conditions: spatial unity, conceptual unity, static unity, and temporal unity.There is no analog of our unity conditions in Ullman et al.'s system.
At the algorithmic level, the systems are very different.While we use a form of meta-interpretive learning (see Section 4.2), they use MCMC.Our system compiles an apperception problem into the task of finding an answer set to an ASP program that minimises the program cost.The ASP problem is given to an ASP solver, that is guaranteed to find the global minimum.MCMC, by contrast, is a stochastic procedure that operates locally (moving from one point in program space to another), and is not guaranteed to (in fact, in practice, it rarely does) find a global minimum.

"Learning from interpretation transition"
Inoue, Ribeiro, and Sakama [12] describe a system (LFIT) for learning logic programs from sequences of sets of ground atoms.Since their task definition is broadly similar to ours, we focus on specific differences.In our formulation of the apperception task, we must construct a (φ, I, R, C ) tuple from a sequence (S 1 , ..., S T ) of sets of ground atoms.In their task formulation, they learn a set of causal rules from a set {(A i , B i )} N i=1 of pairs of sets of ground atoms.
In some respects, their task formulation is more general than ours.First, their input {(A i , B i )} N i=1 can represent transitions from multiple trajectories, rather than just a single trajectory, and corresponds to a generalized apperception task (see Definition 19).Second, they learn normal logic programs, allowing negation as failure in the body of a rule, while our system only learns definite clauses.
But there are a number of other ways in which our task formulation is significantly more general than LFIT.First, our system posits latent information to explain the observed sequence, while LFIT does not construct any latent information.Their system searches for a program P that generates exactly the output state.In our approach, by contrast, we search for a program whose trace covers the output sequence, but does not need to be identical to it.The trace of a unified interpretation typically contains much extra information that is not part of the original input sequence, but that is used to explain the input information.Second, our system abduces a set of initial conditions as well as a set of rules, while LFIT does not construct initial conditions.Because of this, our system is able to predict the future, retrodict the past, and impute missing intermediate values.LFIT, by contrast, can only be used to predict future values.Third, our system generates a set of constraints as well as rules.The constraints perform double duty: on the one hand, they restrict the sets of compossible atoms that can appear in traces; on the other hand, they generate the incompossibility relation that grounds the frame axiom.Note that there is no frame axiom in LFIT.
In [12], Inoue et al. use a bottom-up synthesis method to learn rules.Given a state transition (A, B) in E, they construct a normal ground rule for each β ∈ B: Then, they use resolution to generalize the individual ground rules.It is important to note that this strategy is quite conservative in the generalizations it performs, since it only produces a more general rule if it turns out to be a resolvent of a pair of previous rules.While the Apperception Engine searches for the shortest (and hence most general) rules, LFIT searches for the most specific generalization.In more recent work [104], LFIT has been changed to perform top-down specialization, rather than bottom-up generalization.With this change, LFIT is guaranteed to find the shortest set of rules that explain the transitions.
LFIT was tested on Boolean networks and on Elementary Cellular Automata.It is instructive to compare our system with LFIT on the ECA tasks.When LFIT is applied to the ECA task, it is provided with the one-dimensional spatial relation between the cells as background knowledge.In our approach, by contrast, we do not hand-code the spatial relation, but rather let the Apperception Engine generate the spatial relation itself as part of the initial conditions.(See Section 5.1.2.)It is precisely because our system is able to posit latent information to explain the surface features that it is able to generate the spatial relation itself, rather than having to be given it.
In some situations, positing latent information allows us to constructer a simpler theory.In other situations, however, positing latent information is absolutely essential to making sense of the sequence.There are many apperception tasks for which every interpretation that makes sense of the sequence must include latent information: consider, for example, learning the dynamics of the ECA without spatial information (Section 5.1.2),the Seek Whence sequences (Section 5.1.4),or the binding tasks (Section 5.1.6).Even the following simple example shows the unavoidable need to posit latent information: Here, φ contains one type t, one object a of type t, two unary predicates p and q, and one constraint ∀X : t, p( X) ⊕ q( X).Since p and q are incompatible, the transition from p to q from state S 2 to S 3 must be explained by a causal rule φ⊃ -q( X), where φ is a set of atoms.Now φ cannot be empty, or the rule would be unsafe, φ cannot be {p(X)}, or else q(a) would be derivable at the second time-step, contradicting S 2 = {p(a)}.Similarly, φ cannot be {q(X)}, or else q(a) would not be derivable at time-step 3. Hence φ must contain an atom featuring a predicate distinct from p or q.Hence, every interpretation that makes sense of S must invoke latent information.
One of the theories found by the Apperception Engine for this is θ = (φ, I, R, C ), where: Here, there are two latent predicates, r and s, that are used as counters, so the system can distinguish between the two occurrences of p(a) in S 1 and S 2 .Thus for some sequences, the positing of latent predicates, and the abduction of initial conditions for the latent atoms, is unavoidable. 39FIT has been extended in a number of ways, to increase the range of real-world problems that it can tackle.In [106], LFIT was extended to learn probabilistic models.In [105], the system was extended from the Markov( 1) assumption (where the new state depends only on the current state) to the more general Markov(k) setting (where the new state depends on the last k states).In [107], LFIT was generalised so that as well as working with deterministic models (where all state transitions happen simultaneously), it also can work with other semantics (where only a subset of the transitions may happen at each time-step).In [108], LFIT was extended to work directly with continuous sensor data, rather than assuming the continuous sensor data has first been discretised by some other process.In [109,110], LFIT was reimplemented in a feed-forward neural network, so as to robustly handle noisy and continuous data.

"Unsupervised learning by program synthesis"
Ellis et al. [22] use program synthesis to solve an unsupervised learning problem.Given an unlabelled dataset {x i } N i=1 , they find a program f and a set of inputs {I i } N i=1 such that f (I i ) is close to x i for each i = 1..N.More precisely, they use Bayesian inference to find the f and {I i } N i=1 that minimizes the total description length (negative log probability) of the program, the initial conditions, and the data-reconstruction error: where P f ( f ) is a description length prior over programs, P I (I i ) is a description length prior over initial conditions, and P x|z (• | z i ) is a noise model.This system was designed from the outset to be robust to noise, using Bayesian inference to calculate the desired tradeoff between the program length, the initial conditions length, and the data-reconstruction error cost.They tested this system in two domains: reproducing two dimensional pictures, and learning morphological rules for English verbs.
This system is similar to ours in that it produces interpretable programs from a small number of data samples.Like ours, their program length prior acts as an inductive bias that prefers general solutions over special-case memorized solutions.Like ours, as well as constructing a program, they also learn initial conditions that combine with the program to produce the desired results. 40At a high level, their algorithm is also similar: they generate a Sketch program [111] from the dataset {x i } N i=1 of examples, and use a SMT solver to fill in the holes.They then extract a readable program from the SMT solution, which they then apply to new instances, exhibiting strong generalization.
As well as the high level architectural similarities, there are a number of important differences.First, their goal was to generate an object f (I i ) that matches as closely as possible to the input object x i .Our goal is more general: we seek to generate a sequence τ (θ) that covers the input sequence.The covering relation is much more general, as S i only has to be a subset of (τ (θ)) i , not identical to it.This allows the addition of latent information to the trace of the theory.A second key difference is that we focus on generating sequences, not individual objects.Our system is designed for making sense (unsupervisedly) of time series, sequences of states, not of reconstructing individual objects.A third key difference is that we use a single domain-independent language, Datalog ⊃ -, for all domains, while Ellis et al. use a different domain-specific imperative language for each domain they consider.A fourth key difference is that we use a declarative language, rather than an imperative language.An individual rule or constraint has a truth-conditional interpretation, and can be interpreted as a belief of the synthesising agent.An individual line of an imperative procedure, by contrast, cannot be interpreted as a belief.A fifth major difference is that we synthesise constraints as well as rules.Constraints are the "special sauce" of our system: exclusive disjunctions combine predicates into groups, enforce that each state is fully determinate, and ground the incompossibility relation that underlies the frame axiom.

"Learning symbolic models of stochastic domains"
Pasula et al. [10] describe a system for learning a state transition model from data.The model learns a probability distribution p(s | s, a) where s is the previous state, a is the action that was performed and s is the next state.
Each state is represented as a set of ground atoms, just like in our system.They assume complete observability: they assume they are given the value of every sensor and the task is just to predict the next values of the sensors.
They represent a state transition model by a set of "dynamic rules": these are first-order clauses determining the future state given a current state and an action.These dynamic rules are very close to the causal rules in Datalog ⊃ -.Unlike in our system, their rules have a probability outcome for each possible head.Note their system does not include static rules or constraints.
In their semantics, they assume that exactly one dynamic rule fires every time-step.This is a very strong assumption.But it makes it easier to learn rules with probabilistic outcomes.
They learn state transitions for the noisy gripper domain (where a robot hand is stacking bricks, and sometimes fails to pick up what it attempts to pick up) and a logistics problem (involving trucks transporting objects from one location to another).Impressively, they are able to learn probabilistic rules in noisy settings.They also verify the usefulness of their learned models by passing them to a planner (a sparse sampling MDP planner), and show, reassuringly, that the agent achieves more reward with a more accurate model.
At a strategic level, their system is similar in approach to ours.First, they learn first-order rules, not merely propositional ones.In fact, they show in ablation studies that learning propositional rules generalise significantly less well, as you would expect.Second, they use an inductive bias against constants (p.14), just as we do: "learning action models which are restricted to be free of constants provides a useful bias that can improve generalisation when training with small data sets".Third, their system is able to construct new invented predicates.
But there are also a number of differences.In our system, many rules can fire simultaneously.But in theirs, only one rule can fire in any state.Because of this assumption, they cannot model e.g. a cellular automaton, where each cell has its own individual update rule firing simultaneously.Another limiting assumption is that they assume they have complete observability of all sensory predicates.This means they would not be able to solve e.g.occlusion tasks.

"Nonmonotonic abductive inductive learning"
Ray [11] described a system, XHAIL, for jointly learning to abduce ground atoms and induce first-order rules.XHAIL learns normal logic programs that can include negation as failure in the body of a rule.
XHAIL is similar to the Apperception Engine in that as well as inducing general first-order rules, it also constructs a set of initial ground atoms.This enables it to model latent (unobserved) information, which is a very powerful and useful feature.At the implementation level, it uses a similar strategy in that solutions are found by iterative deepening over a series of increasingly complex ASP programs.The simplified event calculus [54] is represented explicitly as background knowledge.
But there are also a number of key differences.First, it does not model constraints.This means it is not able to represent the incompossibility relation between ground atoms.Also, XHAIL does not try to satisfy our other unity conditions, such as spatial and conceptual unity.Second, the induced rules are compiled in XHAIL, rather than being interpreted (as in our system).Representing each candidate induced rule explicitly as a separate ASP rule means that the number of ASP rules considered grows exponentially with the size of the rule body. 41Third, XHAIL needs to be provided with a set of mode declarations to limit the search space of possible induced rules.These mode declarations constitute a significant piece of background knowledge.Now of course there is nothing wrong with allowing an ILP system to take advantage of background knowledge to aid the search.But when an ILP system relies on this hand-engineered knowledge, then it restricts the range of applicability to domains in which human engineers can anticipate in advance the form of the rules they want the system to learn. 42  7.6.The Game Description Language and Datalog ⊃ - Our language Datalog ⊃ -is an extension of Datalog that incorporates, as well as the standard static rules of Datalog, both causal rules (Definition 7) and constraints (Definition 8).The semantics of Datalog ⊃ -are defined according to Definition 9. Unlike standard Datalog, the atoms and rules of Datalog ⊃ -are strongly typed (see Definitions 4,5,and 7).
At a high level, Datalog ⊃ -is related to the Game Description Language (GDL) [113].The GDL is an extension of Datalog that was designed to express deterministic multi-agent discrete Markov decision processes.The GDL includes (stratified) negation by failure, as well as some (restricted) use of function symbols, but these extensions were carefully designed to preserve the key Datalog property that a program has a unique subset-minimal Herbrand model.The GDL includes special keywords, including init for specifying initial conditions (equivalent to the initial conditions I in a (φ, I, R, C ) theory), and 41 It shares the same implementation strategy as ASPAL [112] and ILASP [96].See Section 5.2.4 for discussion of the grounding problem associated with this family of approaches.The discussion is specifically focused on ILASP, but we believe the same issue affects ASPAL and XHAIL mutatis mutandem. 42See Appendix C of [62] for a discussion of the use of mode declarations as a language bias in ILP systems.
next for specifying state transitions (equivalent to our causal rules).The inductive general game playing (IGGP) task [114,93] involves learning the rules of a game from observing traces of play.
An IGGP task is broadly similar to an apperception task in that both involve inducing initial conditions and rules from traces.But there are many key differences.One major feature of Datalog ⊃ -is the use of constraints to generate incompossible sets of ground atoms.These exclusion constraints are needed to generate the incompossibility relation which in turn is needed to restrict the scope of the frame axiom (see Definition 9).
The main difference between Datalog ⊃ -and the GDL is that the former includes exclusion constraints.The exclusion constraints play two essential roles.First, they enable the theory as a whole to satisfy the condition of conceptual unity.Second, they provide constraints, via the condition of static unity, on the generated trace: since the constraints must always be satisfied, this restricts the rules that can be constructed.Satisfying these constraints means filling in missing information.This is why a unified interpretation is able to make sense of incomplete traces where some of the sensory data is missing.

Related application areas
We briefly outline four related research areas where the Apperception Engine can be applied.One related area of research is learning action theories [115][116][117][118][119]. Here, the aim is to learn the transition dynamics in the presence of exogenous actions performed by an agent.The aim is not to predict what actions the agent performs, but rather to predict the effects of the action on the state.
Another application area is relational reinforcement learning [120][121][122].Here, the agent works out how to optimize its reward in an environment by constructing (using ILP) a first-order model of the dynamics of that environment, which it then uses to plan.Here, the Apperception Engine can be used to construct the dynamics model.
A related application area is learning game rules from player traces [123,96].Here, the learning system is presented with traces (typically, sequences of sets of ground atoms), representing the state of the game at various points in time, and has to learn the transition dynamics (or reward function, or action legality function) of the underlying system.
A fourth related area is the predictive processing (PP) paradigm [124,[6][7][8], an increasingly popular model in computational and cognitive neuroscience.Inspired by Helmholtz, the model learns to make sense of its sensory stream by attempting to predict future percepts.When the predicted percepts diverge from the actual percepts, the model updates its parameters to minimize prediction error.The PP model is probabilistic, Bayesian, and hierarchical: probabilistic in that the predicted sensory readings are represented as probability density functions, Bayesian in that the likelihood is combined with prior expectations [6], and hierarchical in that each layer provides predictions of sensory input for the layer below; there are typically many layers.While PP focuses on prediction, the Apperception Engine generates an interpretation that is equally adept at predicting future signals, retrodicting past signals, and imputing missing intermediary signals.In our approach, the ability to predict future signals is a derived capacity, a capacity that emerges from the more general capacity to construct a unified interpretation -but prediction is not singled out in particular.The Apperception Engine is able to predict, retrodict, and impute -in fact, it is able to do all three simultaneously using a single incomplete sensory sequence with elements missing at the beginning, at the end, and in the middle.

Summary
In summary, there are various other systems that construct dynamic rules for explaining sequences.But these systems are unable to posit latent hidden information to make sense of the sequence.These systems are able to predict future elements of the sequence, but are not able to retrodict earlier elements, or impute missing intermediary elements.For problems that require positing latent hidden information, 43 or problems that require retrodiction and imputation as well as prediction, the Apperception Engine is particularly well suited.

Conclusion
This paper is an attempt to answer a key question of unsupervised learning: what does it mean to "make sense" of a sensory sequence?Our answer is that making sense means constructing a symbolic theory containing a set of objects that persist over time, with attributes that change over time, according to general laws.This theory must both explain the sensory input, and satisfy the unity conditions of Section 3.3.As well as providing a precise formalization of this task, we also provide a concrete implementation of a system that is able to make sense of the sensory stream.We have tested the Apperception Engine in a variety of domains; in each domain, we tested its ability to predict future values, retrodict previous values, and impute missing intermediate values.Our system achieves good results across the board, outperforming neural network baselines and also state-of-the-art ILP systems.
Of particular note is that the Apperception Engine is able to achieve human performance on challenging sequence induction intelligence tests.We stress, once more, that the system was not hard-coded to solve these tasks.Rather, it is a general domain-independent sense-making system that is able to apply its general architecture to the particular problem of Seek Whence induction tasks, and is able to solve these problems "out of the box" without human hand-engineered help.We also stress, again, that the system did not learn to solve these sequence induction tasks by being presented with hundreds of training examples. 44Indeed, the system had never seen a single such task before.Instead, it applied its general sense-making urge to each individual task, de novo.
Our architecture, an unsupervised program synthesis system, is a purely symbolic system, and as such, it inherits two key advantages of ILP systems [62].First, the interpretations produced are interpretable.Because the output is symbolic, it can be read and verified by a human. 45Second, it is very data-efficient.Because of the language bias of the Datalog ⊃ - language, and the strong inductive bias provided by the unity conditions, the system is able to make sense of extremely short sequences of sensory data, without having seen any others.However, the system in its current form has some limitations that we wish to make explicit.First, the sensory input must be discretized before it can be passed to the system.We assume some prior system has already discretized the continuous sensory values by grouping them into buckets.One possible approach to deal with continuous sensory values is to combine the Apperception Engine with a neural network that maps the raw continuous inputs into categories.We have recently developed such an extension, in which we discretize the input by simulating a binary neural network.The binary neural network is implemented in ASP, so the weights of the network and the rules of the theory can be found simultaneously by solving one large SAT problem.
Second, our implementation as described above assumes all causal rules are fully deterministic.It is quite straightforward to add non-determinism to the Datalog ⊃ -framework: we can define an extended theory as a theory with initial conditions for each time-step (rather than only allowing initial conditions for the first time-step, as in Definition 2).An extended theory θ = (φ, {I 1 , ..., I T }, R, C ) generates a trace τ (θ) = (A 1 , A 2 , ...) in exactly the same way as in Definition 9, with one small exception: I t ⊆ A t replaces I ⊆ A 1 .In other words, new atoms can be abduced at each time-step.This would allow us to handle non-determinism by abducing atoms that change their truth-value according to {I 1 , ..., I T } instead of according to the rules in R.
Third, the size of the search space means that our system is currently restricted to small-to-medium-size problems. 46 Going forward, we believe that the right way to build complex theories is incrementally, using curriculum learning: the system should consolidate what it learns in one episode, storing it as background knowledge, and reusing it in subsequent episodes. 47 We hope in future work to address these limitations.But we believe that, even in its current form, the Apperception Engine shows considerable promise as a prototype of what a general-purpose domain-independent sense-making machine must look like.

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Example 4 .
The theory θ of Example 3 satisfies the four unity conditions since:1.For each state A i in τ (θ), a is connected to b via the singleton chain {r(a, b)}, and b is connected to a via {r(b, a)}.
can use the trace to predict the future values of our two sensors at time step 10, since A 10 = {on(a), on(b), r(a, b), r(b, a), p 2 (a), p 1 (b)}.

Fig. 1 .
Fig.1.The varieties of inference.Here, shaded elements are given, and unshaded elements must be generated.X and Y are sets of facts, while P is a set of rules for transforming X into Y .

Fig. 3 .
Fig. 3.One trajectory for Rule 110.Each row represents the state of the ECA at one time-step.In this prediction task, the bottom row (representing the final time-step) is held out.

Fig. 4 .
Fig. 4. Sequences from Seek Whence and the C-test.

Fig. 5 .
Fig. 5.A multi-modal trace of ECA rule 110 with eleven light sensors (left) l 1 , ..., l 11 and two touch sensors (right) t 1 , t 2 attached to cells 3 and 11.Each row represents the states of the sensors for one time-step.For this prediction task, the final time-step is held out.

Fig. 7 .
Fig. 7. Comparison with baselines.We display predictive accuracy on the held-out final time-step.

Fig. 10 .
Fig.10.Comparing the data-efficiency of the noise-robust version of the Apperception Engine with the noise-intolerant version.We plot mean percentage accuracy against length of the sequence.The noise-intolerant version achieves 100% accuracy when the sequence is length 10 or over, while the noise-robust version only achieves this level of accuracy when the length is over 30.

Fig. 11 .
Fig. 11.Comparing the accuracy of the noise-robust version of the Apperception Engine with the noise-intolerant version.We plot mean percentage accuracy against the number of mislabellings.The noise-intolerant version deteriorates to random as soon as any noise is introduced, while the noiserobust version is able to maintain reasonable (88%) accuracy with up to 30% of the sequence mislabelled.
, I | S)Using Bayes' rule this can be recast as:arg max R,I p(R, I | S) = arg max R,I p(S | R, I)p(R, I) p(S) = arg max R,I p(S | R, I)p(R, I) = arg max R,I p(S | R, I)p(R)p(I | R)Here, the likelihood p(S | R, I) is the proportion of S that is entailed by R and I , the prior p(R) is the size of the rules, andp(I | R) is the size of I .At the algorithmic level, Ullman et al. apply Markov Chain Monte Carlo (MCMC).MCMC is a stochastic search procedure.

Definition 12. A theory θ satisfies spatial unity if
17: for each state A t in τ (θ) = (A 1 , A 2 , ...), for each pair (x, y) of distinct objects, x and y are connected via a chain of binary atoms {r 1 1 , A 2 , A 3 , A 4 , ...).It is straightforward to check that A 1 , A 2 , and A 3 satisfy each constraint in C .Observe that A 4 repeats A 1 , thus Theorem 1 ensures that we do not need to check any more time steps.4. Temporal unity is automatically satisfied by the definition of the trace τ (θ) in Definition 9.

Table 1
Enumerating (t, n) pairs.Row t means that there are t types in T , while column n means there are n tuples of the form (O , P , V , N → , N ⊃ -, N B ) to enumerate.We increment n by 100.The entries in the table represent the order in which the (t, n) pairs are visited.P , V , N → , N ⊃ -, N B ) tuples using the types in T .One way of enumerating k-tuples, where k > 2, is to use the diagonalization technique recursively: first enumerate pairs, then apply the diagonalization technique to enumerate pairs consisting of individual elements paired with pairs, and so on.But this recursive application will result in heavy biases towards certain k-tuples.Instead, we use the Haskell function Universe.Helpers.choicesto enumerate n-tuples while minimizing bias.The choices :: Once we have a (T , n) pair, we need to emit n (O ,

Table 2
The number of ground clauses in the ASP encoding of Algorithm 2.
Updates for ECA rule 110.The top row shows the context: the target cell together with its left and right neighbour.The bottom row shows the new value of the target cell given the context.A cell is black if it is on and white if it is off.

Table 5
Ablation experiments.We display predictive accuracy on the final held-out time-step.
Ullman etal.rule out such trivial solutions by adding two restrictions.First, they distinguish between two disjoint sets of predicates: the surface predicates are the predicates that appear in the input S, while the core predicates are the latent predicates.Only core predicates are allowed to appear in the initial conditions I .This distinction rules out the trivial solution above, but there are other degenerate solutions: for each surface predicate p, add a new core predicate p c .If p(k 1 , ..., k n ) is in S, add p c (k 1 , ..., k n ) to I .Also, add the rule p( X 1 , ..., X n ) ← p c (X 1 , .., X n ) to R. Clearly, R, I |= S but this solution is unilluminating, to say the least.To prevent such degenerate solutions, the second restriction that Ullman et al. add is to prefer shorter rule-sets R and smaller sets I of initial atoms.