Model checking games and a genome sequence search

The paper is considered a concept of model checking games to solve algorithmic puzzles. We describe current approaches in this field and move to a game between a user and a software model checker with the goal to provide a solution to a problem, encoded in a transition system and an LTL formula with a requirement. We show how to encode and solve some problems using this approach. Then we move to the problem of searching a pattern in a genome sequence. We implement the Z-function search method in Promela, construct the model, provide the input of real viral data, and then play a model checking game with SPIN verifier. We created a fuzzy substring search method using the non-deterministic choice operator. Based on experiments we made, we discuss that the problem to find a pattern with some deviations is only solvable using the swarm verification and hash compact methods.


Introduction
To foster interest in logical methods, in particular, formal verification, it is advisable to train such methods using games and puzzles. However, the same methods can be used to solve real problems. In this paper, we show the use of non-deterministic programming for the task of finding a pattern in a genome sequence.
The contributions of the paper are (a) focus on the model checking games concept (b) show how to encode real algorithms in Promela (c) discuss an effective implementation of string comparison in Promela (d) move to a fuzzy string comparison (e) apply it in the genome sequence search (f) discuss ways how to solve hard computation tasks in model checking using the swarm model checking and state hashing approaches.

Related work
The concept of model checking games originated from logic and theoretical model checking. The evaluation of logical formulae can be described by such games, played by two players on an arena which is formed as the product of a structure K and a formula  . One player (Verifier) attempts to prove that  is satisfied in K while the other (Falsifier) tries to refute this [1]. Earlier, this formalism was used in [2] to play property checking games in a process calculus and modal mu logic with pre-defined rules for players' moves. The formalism is used to study a particular logic and construct wining strategies. Recently presented [3] differential hybrid games are contests of two players, called Angel and Demon, over hybrid program  and property  that is  ] [ and      refer to complementary winning conditions (  for Demon,   for Angel); the achievements in this theory can be used to construct cooperative hybrid systems. In this paper, we proceed to a different way: the model checking game will have two players (a user and a software model checker), the user declares that the system does not satisfy some formula  and the model checker tries to refute it and provide a counter-example.
Moving to genome mathematics, an introduction to the field is done in [4]. Some methods to reduce the computational complexity of molecular biology tasks are presented in [5]. The community-approved approach to rapid sequence comparison, basic local alignment search tool (BLAST), is presented in [6].
In this paper, we apply the Z-function approach to pattern search and encode it in Promela language, that has not been done before. [7] is a utility for model checking the correctness of distributed software. The abbreviation SPIN stands for "Simple Promela Interpreter". The SPIN system verifies not the programs themselves, but their models [8]. To build a model for an original parallel program or an algorithm, the verifying engineer (usually manually) builds a representation of this program in a C-like input language, called Promela (Protocol MEta-LAnguage).

Model checking with SPIN SPIN
The main purpose of the language is to model discrete systems and protocols [9]; nevertheless, due to its initial design, the language is capable to describe real sophisticated models of different software types. One complex model is discussed in [10] where an ODE was modeled and checked using explicit floating-point arithmetic.
In this paper, we rely on the following language features: • the presence of arrays; • the presence of do-while loops; • the presence of if clause including non-deterministic choice.
As well as we use the following SPIN model checker features: • checking of LTL properties expressed in predicates with key program variables; • ability to generate a counter-example as a trail of visited states if the LTL property does not hold; • optimized depth-first search (DFS); • bitstate hashing to dramatically reduce used memory; • ability to parallelize the model checking process using the swarm technique.

Bitstate hashing and hashcompact
In order to reduce memory for storing the states, in addition to strict (exhaustive) verification, SPIN offers hashing methods to do the checking that can visit most of the states until a hash collision occurred.
In such case, for every state of S bits, a hash value of m bits is computed, which is associated with a m unique bit position within a large bit array of size m 2 [11]. For every new generated hash value, the tool inspects the current value of the bit that corresponds to the hash value, and if it is zero, set it to one. If the bit is already set, it counts this as a hash collision. Supplementary, the SPIN tool by default uses two hash functions, and stores two bits in the bit array for every state. A hash collision now requires a collision on both bits.
An alternative strategy recommended in [12] is called hashcompact. In the hashcompact method, the state descriptor is compressed from S bits to 64 bits, using a single hash function.

Swarm model checking
Swarm model checking is an approach to generate and run a bunch of verification tests (VTs) by combining three basic ideas to modify the search process [13]: • search randomization (use different seed values for non-deterministic choices); • search diversification (performing searches forwards or in reverse, varying hashing options); • search parallelization (run multiple VTs in parallel).
Swarm is implemented using a pre-processor tool that generates a script to compile different VTs from an input Promela model and runs them.

SARS-CoV-2 genome: related information
In this subsection, we describe the coronavirus genome related information that corresponds to our goals.
SARS-CoV-2, the coronavirus that cases CoVID-19 pandemic, is having a strong influence to the world economy, led to thousands of deaths and changed plans of billions of people; in the other side, it catalyzes the processes of digitalization and puts an enormous interest to research in the sphere of biology and computational biology.
The coronavirus genome has been already decoded and is available in [14], in [15] Wu et al. analyzed the genome and found that it is 89% similar to the bat coronavirus bat-SL-CoVZC45 [16]. The viral genome is represented as a single-stranded RNA, which consists of adenine (A), guanine (G), cytosine (C) and uracil (U), but uracil symbol is often represented as thymine (T) in the sequence to do proper software support, so we have a string of 29903 nucleotides with alphabet {A, G, C, T}. In figure 1, we depicted a graphical representation of the viral genome that was built using a patched version of mfold [17].
One of the main tasks in this field is the comparison of the genomes, that can help to investigate from whose animals the virus has come and which parts of them are changed due to mutations. The latter brings us to the problem of string comparison, and the comparison should be fuzzy (it should allow some deviations).

Substring search methods
In this subsection, we recall some methods of finding a pattern in a given string.
In a naive algorithm, the search for all admissible shifts is performed using a cycle in which the condition for the equality of the current characters of the string and the pattern is checked. Such an algorithm has ) ( 2 N O complexity, where N corresponds to the length of the string (we will use | | string for the length).
There are plenty of algorithms to do the searching more effectively (including hashing, trie, suffix automaton, Knuth-Morris-Pratt automaton). In this paper, we consider the Z-function algorithm, which to search the pattern. It is fast and requires no pointers or complex data structures so it could be implemented in such a modeling language as Promela.
The Z-function from string S is an array Z, each element Z[i] of which is the length of the longest common prefix between S and the suffix of S starting at i: The pseudocode to build the Z-function [18] in a loop through a given string and a pattern according to (1) The pseudocode to find a substring pattern in a string text using the Z-function [18]: It is pretty similar to the algorithmic pseudocode that we have shown before, but this implementation opens some doors to use the formal verification and model checking games for fuzzy comparing of genomes.

Model checking games
In this section, we describe a concept of Model checking games. In Karpov's book [19], the method of puzzle solving by the model checker was introduced by the example of wolf, goat and cabbage problem. The idea of the puzzle is as follows:

It is necessary to transfer the both three alive to the different side of a river using series of trips in a boat that only carry two objects and a ferryman. While the heroes stay steady in the presence of the man, but there exist some restrictions while they stay alone on one and the other side of the river: the wolf can eat the goat and the goat can eat the cabbage.
A domain-specific approach to construct and solve the task is given in [20]. The task can be solved using a recursive DFS algorithm, by trying a path of transfers for different objects with these restrictions. Using model checking, it is proposed first to encode the state of the system and the rules of changing the state, then create an LTL rule that "Always the finite state will not be reached" and if the solution really exists, the model checker can find a path to the finite state and present it as a counterexample. Moreover, the state trail to the end state becomes the solution of the problem.
We describe the approach using another different simple Numeric puzzle that requires not so much coding. Let there be a number n, then: • if it is even, divide it by 2, i.e. n ⇒ n / 2; • if it is odd, multiply by 3 and add 1, i.e. n ⇒ 3n + 1; • repeat the actions until n achieves 1.
If we start with the number 7, is it possible to get 1? To solve the problem, we encode the task rules in Promela: We have also added the LTL clause check_me, in which we try to ensure that n will never be 1. As the solution exists, the model checker while the verification will find a path to get 1 from 7 using the APITECH II Journal of Physics: Conference Series 1679 (2020) 032020 IOP Publishing doi:10.1088/1742-6596/1679/3/032020 6 rules, and the steps how to get 1 will be printed by our printf operators in the simulation mode using a generated counter-example trail.
Then we can move to a more sophisticated example from our tutorial in [21]. We solve the wellknown Hanoi tower puzzle: Let there be three rods on which one can mount round disks of different sizes, and it is a rule that a smaller disk can be put on a larger disk, but not vice versa. We assume that there are five discs. Initially, they are all on the first rod. It is necessary to move them (all five) to the third rod, using the second as an intermediate.
Our structure of the solution in Promela could be as the follows: count1--; count2++; } ::(count2 == 0 || (count2 < N && rod2[count2-1] > disk)) //or don't try -> skip; fi } //try 1 -> 3; try 2 -> 1 //try 2 -> 3; //try 3 -> 1; try 3 -> 2 od } In this snapshot, we encode the problem as a loop in which we try to move a top disk to a different rod if it is possible with respect to the rules. Note we also encode some non-deterministic choices, to move or not to move the disk, and it makes the possibility to check all the variants. The LTL rule "always that we will not collect all the five disks at the last rod" is a denial of the condition for solving the problem. In our experiment, the model checker finds the solution as a counter-example but it is not optimal, therefore, the number of moves should also be added to the LTL property and some series of solutions should be done.
After all, now we can describe the process of the model checking game (figure 2). Based on the task description, the user creates a transition system just to describe steps of the task, as well as an LTL formula to encode the inability of reaching the final state (the user is a sceptic now). Then the user starts to play a game with the model checker. The verifier program is able to say "you are not right, the problem is solvable, here is the counter-example" if the solution really exists and the checker is capable to solve the task in a given time and memory constraints using its optimized DFS algorithms. Then the user can improve the rules, change the program to add more output or properties to solve the problem more optimally based on state trail that he got from the model checker. Then the game will continue until the user is satisfied.
Where  is an LTL formula, T is a transition system, C is a counter-example as a solution of the task, part of the transition system,  here means that the verifier was unable to find a counter-example.

Figure 2.
A scheme of the model checking games.

Fuzzy string search
In order to do fuzzy substring search, we added the following into the last code in the section 3: • a non-deterministic choice when we compare symbols while building the Z-function; • a condition to limit the possible percentage of changes.

Input data generation
As input strings are represented as arrays, we add the definition for the input alphabet: Promela does not support I/O operations, so we implemented a .fasta file (with the input genome sequence) processor and a Promela code generator, in figure 3 we show a fragment of the generated input data sequence. Using such an approach, we get a bunch of lines of code, 29903 states and transitions just to fill the input data of the viral genome. Another variant could be using embedded C-code constructions to load the data according to [22], but we did not test it since we rely on pure model checking techniques.

Never claim for the model checking game
According to rules of model checking games, we should specify a negation for the rule that shows the fact of solving the puzzle. In a fuzzy genome sequence search, we specify a simple rule: ("always it is impossible to find a substring"), where variable impossible is the variable that is changed in a linear substring search using Z-function we built previously. Here is the code to do this search: To think further, a real model checking game to compare two full genomes with a given deviation rate (in this case, SARS-CoV-2 and bat-SL-CoVZC45 to prove 89% similarity) requires a lot of VTs with different randomized transitions seeds. The task here should be divided into loading the data to common memory (the same phase to all VTs) and then different Z-function calculations using the same data. It would require a custom model checker. We also see that the CPU swarm technique is not a good idea to execute a bunch of VTs, and possible GPU swarm [23] or FPGA swarm [24] should be used.