Computable Model Discovery and High-Level-Programming Approximations to Algorithmic Complexity

Motivated by algorithmic information theory, the problem of program discovery can help find candidates of underlying generative mechanisms of natural and artificial phenomena. The uncomputability of such inverse problem, however, significantly restricts a wider application of exhaustive methods. Here we present a proof of concept of an approach based on IMP, a high-level imperative programming language. Its main advantage is that conceptually complex computational routines are more succinctly expressed, unlike lower-level models such as Turing machines or cellular automata. We investigate if a more expressive higher-level programming language can be more efficient at generating approximations to algorithmic complexity of recursive functions, often of particular mathematical interest.


Introduction
Since first proposed as a measure of complexity [12], approximations to, or measures inspired by, Kolmogorov-Chaitin or algorithmic complexity, have found many applications [13].But at its core, as an semi-computable measure, it can only be applied by indirect methods such as computable weaker versions based on statistical lossless compressibility [13,7] or computationally expensive approximations [18,19].Statistical compressibility is limited to what entropic measures can find not beyond trivial regularities and redundancies without resort to uncomputability [22].Alternative approximations by means of exhaustive searches are also very expensive [15,20] and are limited to small programs.For very small Turing machines this can be avoided through clever tricks that cannot be expanded to larger machines [15] and often produce the kind of programs that humans would not find very interesting such as the ones generating highly recursive objects such as arithmetic or geometric progressions that mathematicians find of high interest [8].This kind of approach has inspired a whole line of research into computable models [24] for causal and knowledge discovery [20,21].
In a seminal paper [1], Calude and Stay explored non-terminating programs from a probabilistic viewpoint.In further papers [2,3], they also proposed techniques that can be applied to estimate the probability that a program will not halt after certain execution time.
This paper builds on Calude's and other people's work in order to combine exhaustive exploration of programs with an estimation of their non-halting probability to produce approximations of Kolmogorov-Chaitin complexity.We do this using a high-level programming language.IMP [17] is used to investigate if a more expressive and efficient higher-level programming language can generate faster approximations to the algorithmic complexity of recursive functions, often of mathematical interest (e.g.short descriptions for arithmetic sequences).
Assuming optimality, we systematically generated all programs in the language IMP up to certain length and execute them in a controlled environment.We registered their output to find the smallest program to produce a given string, establishing its Kolmogorov-Chaitin complexity under this model.In order to deal with non-halting programs we estimated their halting probability and used it to execute those programs to ascertain whether they would stop within an acceptable threshold.

A Simple Imperative Programming Language
For our programming language we decided on IMP, a simple textbook example of an imperative programming language [17] that is Turing-complete.The syntax of IMP is given by the grammar in figure 1.One can assume IMP has a straightforward structural operational semantics [14].Registers of the type x[N ] hold natural numbers as possible values.But we are interested instead in single binary strings as the output of programs.We adopt the following convention to convert the values in the registers used by a program into binary strings by a two-step mapping: 1.For each register, we map its contents to a binary string according to the canonical ordering of strings (strings are first sorted by length and then ordered lexicographically).
2. We take the concatenation of all the registers' strings to be the program's output.In other words, a program's output is computed by: where f (x[n]) is the function that transforms x[n] into a binary string.
For step 1, we can use the very efficient algorithm 1.This algorithm can be derived from the simple observation that strings of 0s always correspond to numbers of the form 2 n − 1 and vice versa.This is because the first string of 0s corresponds to the number 1 and there are always 2 n strings of size n.So, if the string corresponding to the number 2 n − 1 is the string of n 0s, then the string of n + 1 0s must correspond to 2 n − 1 + 2 n = 2 n+1 − 1.
Input: A natural number n Output: The nth string in canonical order l ← log 2 (n + 1) ; c ← n − (2 l − 1); return c in l bits Algorithm 1: Algorithm for obtaining the n-th string in canonical order.
Step 2 involves concatenating the strings in all registers.Since registers are initialised as 0 and 0 is mapped to the empty string, the concatenation will be finite, and will only involve those registers which were modified by a given program.
For a small example of how we compute a program's output, consider the following program: At the end of the program's execution, register 0 contains 2, register 1 contains 1, register 2 contains 3, and the rest of the registers all contain 0.
To compute the program's output, each register is converted to a string.So register 0's 2 is converted to 1, register 1 is converted to 0 and register 2 is converted to 00.The rest of the registers are converted to .Therefore the program's output is: As Kolmogorov-Chaitin complexity is concerned with the smallest program producing a certain output, we need a length metric.We will use the terminal symbols of the language such as skip, while, etc., the decimal digits, and so on to measure the length of a program.
A quick glance at our grammar shows that the insertion of parentheses after each construct guarantees the unicity of the syntax trees.It also yields prefix-free programs, which will be useful in a future extension of this project.

Workarounds for the Halting problem
The Halting problem is the main reason why Kolmogorov-Chaitin's complexity is semi-computable.In order to know whether a certain program will output a given string at the end of its execution, we need to know whether the program will halt at all.But this is the paradoxical nature of an uncomputable problem [16].
In [15], the authors applied reduction techniques for Turing machines with 5 states and 2 symbols, including setting a time limit to run the machines, and were able to analyse the Kolmogorov-Chaitin complexity of a relatively large set of strings.
A small programming language, known as Brainfuck (BF), has also been used to show an empirical approximation [6] to Kolmogorov-Chaitin complexity, as well as in a first attempt to estimate a theoretical error measure for such approximations [11].
The main idea behind the theoretical model of [11] is to trade strict algorithmic solutions for approximate solutions.In other words, a value is considered to yield a solution within a particular range of confidence.The central idea is to pass from the worst case scenario of non-halting to an average case.As a program either halts or loops forever, there is no such thing as average halting in the strict sense, but hard instances of the halting problem are rarely encountered in practice, according to empirical observations provided by Köhler, Schindelhauer and Ziegler (although this depends on the particular encoding of the problem).
Here we are using a completely different method based on the theoretical results of Calude and Stay [1].Their results hold for any universal Turing machine and posit an a priori probability distribution over all running times.The probability space in this case is extended to both program lengths and runtimes, with time and programs uniformly distributed, in contrast to a previous approach by Chaitin based upon a probability space just over program length [5].
The central hypothesis is that long runtimes are in effect rare, and as such they tend to zero in all probability distributions.A method to decide when to stop a running program could be implemented by using an auxiliary Turing machine running a probabilistic anytime algorithm [9].Runtimes for halting programs are algorithmically non-random, which means they can be generated by a machine with a small input, in this case the auxiliary Turing machine.
Calude and Dumitrescu [2] have further proved that it is possible to apply the above method to approximate a runtime limit for the overwhelming majority of halting programs.This limit depends solely on the machine codification and it is independent of size and number of programs analysed.Their method produces two intrinsically connected viewpoints, one probabilistic and the other statistical [3].
Instead of proposing an a priori probability distribution, given that (a) long runtimes imply that the halting probability tends to zero and that (2) halting time values are algorithmically non-random, we are able to build a model for the halting probability based on running times of a finite set of programs executed on a universal Turing machine.In its turn, an upper bound for halting times can be estimated.
To calculate a limit for running times, the cumulative distribution function (CDF) and a (1 − )-quantile is used.Beyond the (1 − )-quantile value there is the upper -tail in which the halting probability is negligible, so these values can be used as a first qualitative measure for our approximation.For a program x running on the machine U , if the execution has not halted after the above-mentioned quantile value, we declare that the program will not halt, even if there are a few programs that still halt beyond this limit.

Decision error, precision and confidence parameters
Thus far we have a probabilistic device proposing that some programs under a certain optimal machine U may not halt according to a value ∈ (0, 1), the decision error.Achieving this result requires a probability distribution, indirectly based on the CDF and the quantile, and a random variable over time [3] where dom(U ) is the set of all the halting programs under U .Now the idea is to approximate the distribution by a long sequence of times.For this reason an Empirical Cumulative Distribution Function is defined as: with an N -dimensional time sampling space [3].Working with these values and using Hoeffding's inequality [10], two parameters are given: (a) the precision parameter λ, with 0 < λ < , and (b) the confidence parameter δ, with 1 − δ ∈ (0, 1).The first one is a bound for the approximation to the CDF by the ECDF; the second one is a confidence level, an index of how good our sampling is.
Having in hand three rational values that can be fine-tuned according to whatever levels of error, precision and confidence are desired in an approximation, a sampling size N can be calculated [3]: Sampling N independent halting programs in this manner and obtaining their running times are the first steps.Thereafter the times can be sorted in increasing order.The maximum time the sampled programs run until halting will be the threshold T. Any program that has not stopped by time t ≥ T will be considered to never halt, with an error equal to .
Two more approaches exist to provide a threshold time T. When it is too hard to sample exactly N (λ, δ) halting programs, it is easier to find an affordable sample of size Ñ .Two values can be given, , λ ∈ (0, 1) with λ < , and the third one can be obtained by: The second approach is as follows: if neither a dynamic nor an affordable sample size is possible, an upper bound for time T can be set as an initial parameter, adding and λ with similar properties as in the last case.A bigger sample of programs is used to obtain dom(U), that is, all the programs halting in a time t < T. Afterwards the size of the set N (T ) is calculated using the values , λ and Ñ = N (T ) as input for the second approach.Both ways may entail a decrease in the confidence level.The sampling size may not be directly dependent on , our main error value, but it is a bound for the precision parameter λ.Very optimistic values can be proposed for , λ and δ to obtain the size of the sampling set.The table 1 shows some values obtained with equation 3.

Generating programs sorted by length
Once we calculated a time limit for the execution of programs of length 9 or less, we executed all of them and observed their output (or gave up on them if the execution took longer than the estimated threshold for halting).But the systematic generation of all programs up to a certain length is not an easy task.In a future paper, we will cover all the technicalities involved, which have to do mainly with the combinatorial explosion of programs produced by the grammar.Here we simply present an overview.
The space of programs is modelled as an enumeration, mapping nonnegative integers called positions to abstract syntax trees corresponding to valid programs, such that every position is mapped to only one program and every valid program is mapped to only one position.
The enumeration used for this paper consists of all IMP programs sorted first by their length and then lexicographically.This enumeration is built in stages.
Starting from the context-free grammar of the language, a base enumeration is constructed where positions are mapped to programs based solely on the production rules of the grammar.This is done by breaking apart a position k as the syntax tree of a program is built from top to bottom.The position determines what kind of tree must be constructed and in turn the kind of tree determines how the position must be broken apart into the positions corresponding to each of its children.This enumeration satisfies the property that given programs p and q at positions i and j respectively, if q is a subprogram of p, then j < i.
For any reasonable measure for the length of programs, the length of a program is strictly greater than the length of any of its subprograms, however the aforementioned property isn't strong enough to guarantee that if q is smaller than p and q isn't a subprogram of p then j < i, which is one of main goals in constructing an enumeration sorted by length.
A simple metric to measure the length of programs is defined.The chosen metric counts the number of nodes in the syntax tree, and considers numbers with d digits as constructs of length d.
Using the base enumeration and the length metric, a counting function is defined which maps non-negative integers to the number of programs of length .The metric informs the way a particular length can be partitioned for different kinds of programs, while the base enumeration works as a blueprint for how to navigate the program space.Table 2 shows the relationship between program length and number of programs in the IMP language, along with the cumulative sum of these counts.

Length
Individual Accumulated  The base enumeration and the counting function are used to build enumerations of programs for a given length, called fixed-length enumerations.If c is the number of programs of length , the corresponding fixed-length enumeration, or -length enumeration, maps positions smaller than c to all programs of length .Fixed-length enumerations are finite, given that the number of programs of a fixed length is finite.
Finally, the canonically sorted enumeration is built by considering the fixed-length enumeration from smallest to largest length.This is achieved by keeping track of the program counts.Let k be any position, and let be the greatest length such that the cumulative sum of program counts c is less than k.Then the corresponding program is at position k − c of the -length enumeration.
The canonically sorted enumeration has the advantage of sorting programs in a predictable order.However, in contexts where the order is not as relevant, the base enumeration has a couple of advantages: • The algorithm to build a program for a given position is simpler and more efficient both in time and space.
• The current implementation includes the inverse process of calculating the position for a given program efficiently.
• Enumerations with different orderings or for subsets of the language can be easily derived.
[By the way, the techniques used to build the sorted enumeration can be applied to any other programming languages and metrics.The base enumeration is constructed in a declarative way by means of a combinator language based on Euclidean division and an n-tuple pairing function akin to Cantor's pairing function for pairs of natural numbers [4].The rest of the stages follow a manual process once a metric has been defined, and as long as the metric computes its value from a program's constituent parts and the mechanism for combining them, then this manual process can be adapted in a straightforward way.]

Sampling the program space
The sorted enumeration is used to navigate the space of programs.Starting from position zero, programs are generated and executed successively, with the guarantee that any programs generated afterwards will be of the same length or longer, which is convenient if the goal is to approximate their Kolmogorov-Chaitin complexity.
The results obtained from this process are collected and analysed after exploring all programs of each length.The space of programs of length at most is denoted by S .In this project we focus our attention on S 9 , which consists of |S 9 | = 123 089 621 programs.
Following the statistical approach of Calude et al for estimating running time, we used a sampling size of N = 13 815 511 corresponding to the last row of table 1.
The sample must consist of programs that halt.In order to achieve this, a random number generator is used to sample uniformly distributed integers in the interval [0, |S 9 |).Interpreted as positions in the sorted enumeration, these integers are effectively encodings of IMP programs in S 9 .After generating a program, it is executed with a large running time threshold T in order to determine if it can be considered part of the sample.
When choosing a threshold we must be careful to not set it too low, which would potentially exclude many halting programs from the sample.On the other hand, setting the threshold too high will affect the performance of the sampling process when executing non-halting programs.We decided to accept the performance loss by fixing T = 10 4 , as this value is quite large in comparison to the lengths of programs being executed.We believe it is a reasonable threshold to use during the sampling process.
For all sampled programs we recorded their position in the sorted enumeration, their length, their output strings, and their running time.Table 3 shows how the sample is distributed by program length and the percentage corresponding to the total number of programs.

Length
Sampled Percentage  The largest running time obtained from the sampled programs was 9 steps, which means that there were no sampled programs that halted between 10 and 10 4 steps, and indicates that the chosen running time threshold was high enough for our purposes.Figure 2 shows the distribution of the sample by the running time, where no programs ran in zero or one steps, and less than a hundred ran in 9 steps.Given the results described above and the statistical estimation of Calude et al, we should run all programs in the sample space using a running time threshold of T = 9, which significantly reduces the evaluation time of nonhalting programs, compared to the threshold T = 10 4 used for the programs in the sample.

Systematic execution of programs in S 9
In order to empirically verify the quality of the chosen running time threshold estimate, we generated and executed all programs in S 9 with threshold T to see if the statistically estimated T is good enough.
Table 4 classifies the number of programs by length into halting and nonhalting, along with the percentage of the total count.The results show a slight decreasing trend for the percentage of programs that halt as program length increases, and correspondingly the trend is inverted for non-halting programs.

Length
Halt (%) Non-halt (%)  The table shows a sudden drop to 66.7%, which interrupts the trend mentioned above.This is because there are only three programs of length 3, and one of them is the first non-halting program: (skip; skip), (while false do skip), and (while true do skip).
The largest running time obtained from the sampling space was in fact 9 steps, which confirms that the estimate T given by Calude's method would have given us the same results.
For the purpose of approximating the algorithmic complexity of a bit string, we are interested in finding non-trivial programs that produce an output longer that the programs' length.However, for the purpose of finding generating computable models, any program of any length will provide, and the number of programs found for the same object, is an indication of the data's nature and its algorithmic probability determining a likelihood as a generating mechanism.
Figure 4 shows the distribution of halting-programs by length and running time, and we expect non-trivial programs to appear in the upper-left triangle, where the running times are large relative to the length of the programs.Figure 5 shows the distribution of halting-programs by output length and program length.We also expect non-trivial programs to appear in the lowerright triangle of figure 5, where the lengths of the output strings are large relative to the length of the programs.In these figures, a darker shade means that there are more programs in the configuration-the intensity of the colouring is log-scaled in order to visualise clearly the distinction between zero programs and just a small number of programs.
In order to specify which programs are trivial in the context of Kolmogorov complexity, we use a standard process to calculate an upper bound for the complexity of a binary string b.First we compute the position n of b in the lexicographical enumeration of all binary strings.Then we build the program p as x[0] := n.Finally the length of p sets an upper bound on K IMP (b).As a base case, we include the program skip with length 1 as the upper bound for the empty binary string.
One future goal in exploring the program space is to improve upon this bound.Even though we explored around 123 million programs, and retrieved a million different output strings, the first smallest program found for every output string corresponded to the trivial program described above consistent with the theoretical expectation.
We found that the shortest programs produced a considerable number of strings.A table showing strings and a sample of the shortest programs producing them can be found at the the following URL: https://kolm-complexity.gitlab.io/approx-kolmogorov/approximatingkolmogorov.html

Program space beyond S 9
While the systematic exploration of programs up to length 9 has allowed us to find the simplest programs producing some small strings, we are still at the level of trivial programs in the sense that their length is on the same order of magnitude than the strings they output.We need to go farther than S 9 in order to find programs whose output is much bigger and therefore more interesting.In Figures 6, 7, 8, 9 we present four families of programs producing an output clearly bigger than the length of the programs: Examples of the output of these programs are shown in Tables 5, 6, 7 and 8.The changing parameter is n.As it can see, for each family there is a lower bound on the value of n when the output starts being clearly bigger than the length of the program itself.Given that the shortest examples that show interesting results live in S 25 , it will take an enormous amount of computational work to generate and execute the programs in this class.In order to get the results in the tables we did not follow the canonical enumeration but the base enumeration mentioned in section 3.3.Generating programs according to the base enumeration does not guarantee that we will find first the shortest program producing an output, but it can give us upper bounds for the Kolmogorov complexity that are definitely smaller than the trivial programs.As this enumeration is also bijective and exhaustive it will eventually produce the shortest program too.In the meantime it has already given us useful information about the Kolmogorov complexity of strings representing the functions 2 n , factorial, n n and n 2 n .The base enumeration can be regarded as a kind of look ahead exploration.

Conclusions and Future Work
We started this project as a proof of concept for approximating Kolmogorov complexity using a very different computational model from those used in previous work.We managed to overcome some very serious technical impediments to efficiently generating vast amounts of programs, and we are confident the project can be scaled up for longer programs.But before discussing how this may be done, we want to draw some conclusions about of considerably greater computational resources).Secondly, all the programs found in S 9 so far are still trivial, in the sense that their length is greater than that of the output strings.This was to be expected, as short strings can be described very succinctly by just presenting them.Yet this approach represents an opportunity to rank them even when short.Other approaches, including lossless statistical compression algorithms, present the same challenges and limitations, and are actually unable to rank short strings for more basic reasons.But there are indications that this could change for programs in S 11 .As this space is considerably larger than S 9 , we could not collect the outputs produced by all programs in S 11 that will be calculated in the future.Nevertheless, Figure 10 shows the distribution of S 11 by program length and running time.In contrast to figure 4, we are starting to get programs in the upper-left region above the identity line.Specifically, there are some programs of length 10 with a running time of 11 steps, and programs of length 11 with a running time of 13 steps, which means (quite significantly) that we are starting to see programs whose execution times exceed their length, and we think this region will grow faster in larger spaces.Longer execution times are needed to produce larger outputs.
Thirdly, using a different enumeration of programs we managed to explore much bigger spaces than S 9 showing how the method can generate nontrivial programs and can find highly recursive ones quickly without many resources even at the beginning of the enumeration by choosing a high-level computer language.This is key for the causal discovery program in [20,24] with applications to, for example, areas of Artificial Intelligence [23].
On the other hand, it is naive to think that such a vast and ever growing enterprise as calculating Kolmogorov complexity can be undertaken by a small team with limited resources.Currently, our workflow looks like this: we read a list of programs, run them, and save the outputs in a text file.After receiving the outputs, we check that the file is accurate through two mechanisms: an integrity check using hashes and a correctness check through the execution of a random sample of programs.
This process can easily be made to work within a collaborative framework.Among others, one option is to use the framework of the Automacoin [25] cryptocurrency tocken.A user of Automacoin will get a reward for running millions of programs and submitting their output.This method of collaboration is sustainable and useful for both parties.
Collaborative frameworks need portable and efficient ways of setting and running partial tasks for collaborators.A portable, light-weight interpreter for IMP will be part of this effort.We are working on this and we already have a nice prototype implemented in Haskell [26].

Figure 2 :
Figure 2: Running time distribution across the sample.

Figure 3
Figure 3 shows how the lengths of the outputs in the sample are distributed.Almost a third of the output strings obtained were the empty string, and the largest outputs consist of 19 bits.

Figure 3 :
Figure 3: Output length distribution across the sample.

Figure 4 :
Figure 4: Distribution of S 9 program length and running time.

Figure 5 :
Figure 5: Distribution of sample space by output length and program length.

Figure 6 :
Figure 6: Family of programs that calculate 2 n .

Figure 7 :
Figure 7: Family of programs that calculate n!.

Figure 8 :
Figure 8: Family of programs that calculate n n .

Figure 9 :
Figure 9: Family of programs that calculate n 2 n .

Figure 10 :
Figure 10: Distribution of S 11 by program length and running time.

Table 2 :
Number of IMP programs by length.

Table 3 :
Number of sampled programs by length.

Table 4 :
Halting and non-halting programs by length.

Table 5 :
IMP programs computing 2 n

Table 6 :
IMP programs computing n!

Table 7 :
IMP programs computing n n the already explored space of programs.First of all, we saw that Calude's method for estimating probable halting times can be applied to our chosen computational model without much trouble.Better still, estimating halting times for bigger program spaces is feasible (we have done this for programs of length 11 already, without need

Table 8 :
IMP programs computing n 2 n