A Graph Theoretic Approach to Randomness Test Based on the Overlapping Blocks

—Cryptographic parameters such as secret keys, should be chosen randomly and at the same time it should not be so difficult to reproduce them when necessary. Because of this, pseudorandom bit (or number) generators take the role of true random generators. Outputs of pseudorandom generators, although they are produced through some deterministic process, should be random looking, that is not distinguishable from true random sequences. In other words they should not follow any pattern. In this paper we propose a new approach using graph theory, to determine the expected value of the index at which a fixed pattern start to appear in a random sequence for the first time. Using the method proposed, a recursion for the number of paths of length n starting from a pattern and never coming back to that pattern can be computed. By means of these recursions, we obtain the probabilities for the indexes at which a fixed pattern appears in the sequence for the first time. Using these expected values and comparing them with the observed values a randomness test can be defined. In this work patters are traced through the sequence in an overlapping manner.


Introduction
The concept of random sequences is vital in cryptography and also in many other fields varying from statistic to computer simulations. In cryptography, random sequences are needed not only in symmetric key encryption or key generation but also for generation of primes for RSA encryption, initialization vectors, salts in hash functions and the like.
Sources of True Random Number Generators (TRNG) are usually some complicated physical events such as lightnings or atmospheric or thermal noises and hence reproduction of them are usually very difficult if not impossible and hence they are not practical in cryptographic applications. The solution for this problem is Pseudo Random Number Generators (PRNG). PRNG produces random looking sequences from short random seeds, by making use of deterministic algorithms. Sequences produced by PRNGs must behave like those obtained from TRNGs, that is, they should not contain any recognizable pattern or an order.
In order to be used safely in cryptography, PRNG's and their outputs must be tested in terms of randomness from many different aspects. A set of statistical randomness tests, called a test suite, can be used to make sure that there is no weakness in the randomness of the sequence that will be used. There are many documents outlining how these statistical randomness test can be designed, [1] gives a detailed information on this. There are many tests defined in the literature [2], [3], [4], [5], [6], [7], [8], [9] are some of them. Also many test suites are available in the literature [10], [11], [12], [13], [14], [15].
The number of rounds at which a block cipher achieves randomness is one of the most important design criteria. Soto et. al. [16] used this idea and analyzed AES competition finalist algorithms from this point of view, using the NIST test suite [15]. In the NIST test suite, there are two randomness tests considering the number of occurrences of a predefined template in a sequence, namely the overlapping template matching test and the non overlapping template matching test. Their computations is valid only for the template B = 111111111. In fact, the probabilities changes depending on the period of the template. In [17], the classification of all possible templates according to their period is given and for each template the exact values of the probabilities are evaluated using generating functions. Finally a new statistical randomness test is proposed.
In this work, we propose a new approach to calculate the probabilities for overlapping templates using a graph theoretical method. Using the obtained values a randomness test can be defined following the steps described in [1].
The organization of the paper is as follows. In section 2, we propose the problem, and then define reversed graph with its transition matrix, and we state and prove two theorems. In section 3, we give probability values for the pattern 010, and in section 4, we give recursions with initial values for the other patterns of length 3. In section 5, we listed characteristic polynomials for all patterns of length 4, from which recursions and then probability values can be derived. In section 6, we derived an explicit Table 1. A binary sequence example index j 1 2 3 4 5 6 7 8 9 10 · · · rj 1 0 0 1 1 0 0 1 0 1 · · · formula for the generating function of the probability sequence of the pattern 010, and computing its derivative at 1, we obtain the expected value of the first occurrence of the pattern in a random sequence. We finish the paper with a conclusion.

Overlapping Blocks
Let {r i } = r 1 , r 2 , r 3 , . . . be a binary sequence and P = b 1 b 2 · · · b l be a fixed pattern. In this paper the formulas for the followings are given: • For each k, the probability P k , corresponding to the first occurrence of the pattern P to be at the position k, is calculated.
• Let j be the first observed position of P in the sequence. The expected value of j is calculated.
For example, considering the binary patterns of length three, for the case P = 010, corresponding to the integer 2, consider the sequence {r i } given in the binary sequence example given in Table 1.
The first occurrence of the pattern P = 010 is at the seventh position. Using the integers a j corresponding to (r j , r j+1 , r j+2 ) 2 ∈ {0, 1, 2, 3, 4, 5, 6, 7}, one can express the same binary sequence {r j } n j=1 in the form {a j } n−2 j=1 as in the Corresponding Integer Sequence example given in Table 2. From this equivalent point of view, the problem we are interested is to determine the index at which a pattern, 2 = (010) 2 as an example, is expected to be observed for the first time.
This way, any binary sequence {r i } l i=1 can be identified with the corresponding integer sequence  Notice that even if the sequence {r i } is a random binary sequence on the set Z 2 , the corresponding sequence {a i } will not be a random sequence on the set Z 8 . If a i = 0, as an example, a i+1 can have only two values, namely 0 or 1. In fact there are only two possible values for a i+1 depending on a i and r i+2 . More clearly, modulo 8, a i is either 2a i−1 or 2a i−1 + 1 depending on whether r i+2 is 0 or 1.
Consider the directed graph, called adjacent graph with eighth vertices corresponding to binary patterns a j = (r j , r j+1 , r j+2 ), given below. Each vertex of this directed graph has two successor vertices and two predecessor vertices.
Consider a path on this graph starting from a vertex and terminating as soon as it reaches to the vertex 2. Any such path corresponds to a binary sequence whose first three terms are determined by the initial vertex. Table 3 lists three of such binary sequences and their corresponding paths.
Using this graph theoretic terminology, one can list all of the paths starting from the vertex 0, from the vertex 1, ... , and finally from the vertex 7 and terminating as soon as they reach to vertex 2 and listing their length, one can compute the probability of this length to be k. An easier method is to make  Table 4. Examples of paths starting from 2 and the corresponding sequences Path 1 Path 2 (reversed) Sequence 2 2 010 3, 7, 6, 5, 2 2, 5, 6, 7, 3 0111010 3, 7, 7, 6, 4, 1, 2 2, 1, 4, 6, 7, 7, 3 011110010 use of the reverse graph obtained by reversing the orientations of the paths. Considering all paths of length k, starting from the vertex 2 and not coming back to it, the desired probability can be calculated. For this purpose, both of the two edges that direct to the vertex 2 are deleted in the graph Reverse Graph with Edges to Vertex 2 Deleted below. Table 4 lists three of such binary sequences and their corresponding paths. Let l k denote the number of all paths, starting from the vertex 2 of the reversed graph with edges to 2 deleted, of length (after the initial 2) equal to k. Note that the pattern 2 is not considered when the length is determined. For k = 1 and k = 2 complete list of all paths are as follows: • There are two paths of length k = 1, namely 2, 1 and 2, 5 and hence • There are three paths of length k = 2, namely 2, 1, 0; 2, 1, 4 and 2, 5, 6 and hence l 2 = 3.

Definition 1 The transition matrix A of a directed graph is defined as the square matrix with entries
A ij equal to the number of edges in the graph from the vertex i to the vertex j Let T denote the matrix obtained from the transition matrix R of the reversed graph by deleting the edges between 2 and its predecessor vertices.
Notice that, since both of the edges to the vertex 2 are deleted, the third column of T is all zero, that is there is no edge pointing to 2. The third row, namely 0 1 0 0 0 1 0 0 indicates that the only paths from 2 are to 1 and to 5. Moreover the rank of T which can be computed as the number of linearly independent columns of T is 4. Consider T , T 2 and    0 1 1 0 1 1  1 0 0 1 1 0 1 1  1 0 0 1 1 0 1 1  1 0 0 1 1 0 1 1  0 1 0 1 0 1 0 1  0 1 0 1 0 1 0 1  1 1 0 1 0 1 1 1  1 1 0 1 0 1 1 Each entry 1 in the third rows that corresponds to 2, of all of these three matrices above is in one to one correspondence with a path having the vertex 2 as a starting point. More clearly, • Two paths 2, 1 and 2, 5 of length 1 are represented by the two 1's in the third row of matrix T .
We claim that this is not a coincidence. In fact, Theorem 2 Consider the reversed graph with edges to 2 deleted. Then (T k ) ij = T k ij , that is the (i, j) th entry of the matrix T k , is equal to the number of all paths in this graph starting from the vertex i to vertex j of length k.
Proof: We will use mathematical induction. For k = 1, it is true by the definition of matrix T . Assume the statement is true for k, and consider T k+1 . Recall the matrix multiplication: To obtain T k+1 ij we multiply the i th row of T with j th column of T k . That is; Here T ir is the number of paths from vertex i to vertex r of length 1, and T k rj is, by the induction hypothesis, the number of paths from vertex r to vertex j of length k and hence the product T ir · T k rj summed over the index r gives the total number of paths of length k + 1 from vertex i to vertex j.
Recall that aim of this paper is to count the number of all paths of length k, starting from a fixed vertex, as an example from vertex 2. In other words, to find the sum of all entries in the 3 rd row of T k . For this reason we want to find T k , or sum of all its entries in a row, in a practical way; for example in a recursive manner. We first illustrate this idea of computing sum of all elements in a row of a matrix using recursion, by an example. First of all we need to obtain a polynomial satisfied by the matrix.
Recall that eigen values of a square matrix n × n matrix A are roots of the characteristic polynomial of the matrix, defined by det(A − λI) where I denote the n × n identity matrix. Trace of a square matrix is defined as the sum of diagonal elements. Equivalently, it is equal to sum of the eigenvalues of the matrix. Similarly determinant is equal to the product of eigenvalues.
Example 1 Consider two bit patterns 0 = 00, 1 = 01, 2 = 10, and 3 = 11. The two possible successors of each of these four patterns are; 0 → 0, 1 1 → 2, 3 2 → 0, 1 3 → 2, 3 and hence, data for the reversed graph is; In other words, the transition matrix T of the corresponding reversed graph is Recall that trace of a square matrix is defined as the sum of diagonal elements. Equivalently, it is equal to sum of the eigenvalues of the matrix. Similarly determinant is equal to the product of eigenvalues. Eigen values of a square matrix n × n matrix A are roots of the characteristic polynomial of the matrix, defined by det(A−λI) where I denote the n × n identity matrix.
Notice that this matrix has trace equal to 2 and determinant equal to 0. We can compute powers of this matrix easily and get and hence for k ≥ 4, by induction we have, T k = 2T k−1 . From this observation we see that matrix T satisfies the equation Therefore for k ≥ 4, the minimal polynomial of T k is (x − 2).
In the general case, T satisfies the polynomial This polynomial equation defines a recursion that that can be used to compute powers of the matrix T easily: a linear combination of 1 and 2 less powers of the same matrix gives the power of the matrix. Moreover, the sum of entries in the ith row of T k can be computed easily by means of this recursion, and hence the total number of paths from vertex i can be obtained. As an example, if A is a 2 × 2 matrix, using the notation A n = a n b n c n d n , and the recursion x n+1 = trace(T )x n −det(T )x n−1 , the following equations can be written Notice that the same linear recurrence relation is also satisfied by a sum of entries of the matrix in any fixed row or fixed column of the matrix. example if This way a recursion to compute the sum of all entries in a fixed row of A k is obtained. This sum, in the case A is the reverse of transition matrix, equal to the total number of paths of length k starting from a certain vertex defined by the row. The degree of the recursion is the same as the degree of the characteristic equation of A. Now, turning back to the 8 × 8 matrix T with characteristic polynomial x 5 (x 3 − 2x 2 + x − 1), one can can write l n+3 = 2l n+2 − l n+1 + l n as a recursion satisfied by l n and hence the following theorem can be stated: Theorem 3 Let l n denote the number of all paths of length n starting from the vertex 2 and never coming back to 2. Then l n satisfies the following recursion relation By convention l 0 = 1, and with simple counting l 1 = 2, l 2 = 3 and l 3 = 5 (and hence l 4 = 9, l 5 = 16, . . . and so on).

Probability Computations of Overlapping Blocks
Recall that on the reverse graph, the total number of paths that starts with 2 and the pattern 2 never appears again, of length k, is denoted by l k and l k satisfies certain recursion. In other words, l k is the cardinality of the set This means that, l k of all 2 k+3 possible sequences, of bit length k + 3 satisfies the condition: of being of length k and the pattern 2 does not appear. This means that • For sequences of bit-length 4 : (r 0 r 1 r 2 r 3 ). There are 16 of them and each contains 2 patterns of length 3: r 0 r 1 r 2 and r 1 r 2 r 3 . Pattern length is 2 and exactly l 1 = 2 of them starts with (010), and does not reach to (010) 2 = 2 again. They are; 0100 and 0101. Hence the probability is P 4 = l 1 2 4 = 2 16 = 1 8 .
There are 32 of them and each contains 3 patterns of length 3: r 0 r 1 r 2 , r 1 r 2 r 3 and r 2 r 3 r 4 and hence pattern length is 3. Exactly l 2 = 3 of them starts with (010), and does not reach to 2 again. These are; 01000, 01001 and 01011. Hence P 5 = l 2 2 5 = 3 32 . • Similarly, P 6 = l 3 2 6 = 5 64 , for sequences of length 7 is P 7 = l 4 2 7 = 9 128 , for sequences of length 8 is P 8 = l 5 2 8 = 16 256 , and so on. Consider random binary sequence of length i + 3 and let P i denotes the probability that this sequence starts with 2 and never comes back to 2. We have seen that Defining l 0 = 1 for convention, (and hence P 3 =  Notice that in order to be able to use this bin values in χ square goodness of fit test, the number of overlapping blocks of length three should be at least 5× 1 0.016944 ≈ 295, and hence length of the sequence should be at least 297.

Other patterns of length 3
Notice that all definitions, explanations, theorems to this point are valid for the pattern 2 = (101). Now we will consider the other patterns of length l = 3. There are eight such patterns: 0,1,2,3,4,5,6,7. Lets denote the transition matrix of the reverse graph with edges between i and its predecessor vertices deleted by T (i) . Then the matrix T above in this new notation is T (2) .
As for the pattern 010, for each of the other patterns of length 3, using these recursions obtained above, corresponding randomness tests can be defined.

Blocks of length bigger than 3
As mentioned above, for longer fixed patterns of length 3 or more, the same arguments work. As an example, for the pattern 0000 of length 4, the transition matrix and its characteristic polynomial and similarly characteristic polynomials of all other patterns of length 4 are given below. Using these polynomials, one can easily obtain recursions as above and hence compute probability and expected values for each of these patterns.

Expected Values
Here we will derive the expected value formula for the pattern 101. Using the recursions obtained above, it is straightforward to derive corresponding formulas for the other patterns.
Recall that if a random binary sequence generator stops the generation as soon as 2 = (010) appears and if P i denotes the probability that length of this sequence is i, we have Consider the generating function of the sequence is the sum of all probabilities and hence F (1) = 1. Moreover, Now recall that for the pattern 101, the sequence {l i } satisfies the recursion l n+3 = 2l n+2 − l n+1 + l n where l 0 = 1, l 1 = 2, l 2 = 3, l 3 = 5, . . . . Moreover P 0 = 0, P 1 = l 0 2 3 , P 2 = l 1 2 4 , . . .. Thus, using this recursion, we can write F (x) = P 1 x + P 2 x 2 + P 3 x 3 + · · · + P n x n + · · · = l 0 2 3 x + l 1 2 4 x 2 + l 2 2 5 x 3 + 2l 2 − l 1 + l 0 2 6 x 4 + 2l 3 − l 2 + l 1 2 7 x 5 + · · · Hence, F (x) can be expressed as rational function: Substituting l 0 = 1, l 1 = 2, l 2 = 3, we obtain, Rearrangement of the terms leads to 32 and, finally we obtain Taking derivative, we obtain Thus the expected value of the index at which the pattern 2 appears for the first time is Using this expected value, a statistical test can be defined to judge whether the first appearance of the pattern, say 101 in the sequence under consideration is too late or too early or as expected.

Conclusion
In this work we introduced a new approach to randomness test based on the overlapping blocks, using graph theory. We give all details, including box bounds for χ-square goodness of fit test, for the pattern 010 and for the other patters, explained how to generalize. Finally we computed the expected value again for the pattern 010, and explained how to generalize to other patterns. As the theorems proven in this paper can easily be generalized to patterns of longer size, as a future work, we plan to extent this study and define randomness tests for longer patters.