Certified Grammar Transformation to Chomsky Normal Form in F*

. Certified programming allows to prove that the program meets its specification. The check of correctness of a program is performed at compile time, which guarantees that the program always runs as specified. Hence, there is no need to test certified programs to ensure they work correctly. There are numerous toolchains designed for certified programming, but F* is the only language that support both general-purpose programming and semi-automated proving. The latter means that F* infers proofs when it is possible and a user can specify more complex proofs if necessary. We work on the application of this technique to a grammarware research and development project YaccConstructor. We present a work in progress verified implementation of transformation of Context-free grammar to Chomsky normal form, that is making progress toward the certification of the entire project. Among other features, F* system allows to extract code in F# or OCaml languages from a program written in F*. YaccConstructor project is mostly written in F#, so this feature of F* is of particular importance because it allows to maintain compatibility between certified modules and those existing in the project which are not certified yet. We also discuss advantages and disadvantages of such approach and formulate topics for further research.


Introduction
Certified programming is designed for proving that a program meets its specification.For this technique, proof assistants or interactive theorem prover are used [1], what allows to check correctness of the program at compile time and guarantees that the program always works according to its specification.Classical fields of application of certified programming are the formalization of mathematics, security of cryptographic protocols and the certification of properties of programming languages.There are two approaches to certified programming [2].In the classical approach the program, its specification, and the proof that the program meets its specification are written separately, as different modules.Such technique costs too much to be applied in software development.More effective approach is to combine program, its specification, and the proof in one module by means of dependent types [3], [4].The most well-known toolchains for program verification are Coq [5], Agda [6], F* [7] and Idris [8].Among them, F* is the only language which supports semi-automated proving and general-purpose programming [9].As a proof assistant, F* allows to formulate and prove properties of programs by using lemmas and enriching types.F* not only infers types of functions, but also the properties of its computations such as purity, statefulness, divergence.For example, consider the following function: The keyword val indicates that we declare a function f and its type signature.The function f takes a function g and an integer value, as arguments.The effect of computation Tot t is used for total expression, which always evaluates to a t-typed result without entering an infinitive loop, throwing exception or other side effects.Hence, one can prove for some programs not only their properties and restrictions on the types, but also guarantee their termination and that a result has assigned type.We apply certified programming using F* to a grammarware research and development project YaccConstructor (YC) [10], [11].YC is a tool for parser construction and grammar processing.Also it is a framework for research and development of lexer and parser generators and other grammarware for .NET platform.The verification of its programs covers the topic of parser correctness: how to obtain formal evidence that a parser is correct with respect to its specification [12].In this article, we consider only one algorithm implemented in YC, namely the transformation of context free grammar to Chomsky normal form, that is a small step towards the certification of entire project.The algorithm of grammar normalization consists of four transformations.We prove totality of each of them and establish an order of their application to the input grammar.In addition, we describe the peculiarities of evaluation F* as a proof assistant and formulate topics for further research.

Overview of F*
We use a functional programming language F* [7] for program verification.It is the only language that support semi-automated proving and general-purpose programming [9].The main goal of this tool is to span the capabilities of interactive proof assistants like Coq [5] and Agda [6], general-purpose programming languages like OCaml and Haskell, and SMT-backed semi-automated program verification tools like Dafny [13] and WhyML [14].
Type system of F* includes polymorphism, dependent types, monadic effects, refinement types, and a weakest precondition calculus [15], [16].These features allow expressing precise and compact specification for programs [7].Dependent function type has the following form  1 :  1 →. .→   :   [ 1 . . −1 ] →   [ 1 . .  ].Each of a function's formal parameters are named   and each of these names in the scope to the right of the first arrow that follows it.The notation [ 1 . .  ] indicates that the variables  1 . .  may appear free in .Refinement type has a form : {ℎ()}.It is a sub-type of  restricted to those expressions of type  that satisfy a predicate ℎ().
In addition to inferring a type, F* also infers side effects of an expression such as exceptions and state.The following are the most significant monadic effects.
 Tot tthe effect of a computation that guarantees evaluation to a ttyped result, without entering an infinite loop, throwing an exception, reading or writing the program's state.
 ML tthe effect of a computation that may have arbitrary effects, but if some result is computed, then it is always of type t.
 Dv tthe effect of a computation that may diverge.
 ST tthe effect of a computation that may diverge, read, write or allocate on a heap.
 Exn tthe effect of a computation that may diverge or raise an exception.The effects {Tot, Dv, ST, Exn, ML} are arranged in a lattice, with Tot at the bottom, ML at the top, and with ST unrelated to Exn.There are two main approaches to prove properties: either by enriching the type of a function (intrinsic style) or by writing a separate lemma about it (extrinsic style).You can see an example of the first approach below; keyword val indicates declaration of a value and its type signature.
val append: l1:list 'a -> l2:list 'a -> Tot (l:list 'a{length l=length l1+length l2}) let rec append l1 l2 = match l1 with | [] -> l2 | hd :: tl -> hd :: append tl l2 The following example demonstrates extrinsic style, in which the formula after keyword requires is the pre-condition of the lemma, while the one after keyword ensures is its post-condition.

> append_len tl l2
There is no general rule which style of proving to use, but in some cases it is impossible to prove a property of a function directly in its types and one has to use a lemma.When defining lemmas or expressions that are total, F* automatically proves their termination.The termination check is based on a well-founded relation.For natural numbers, F* uses classical decreasing metric, for inductive types -the sub-term ordering, for recursive function, it requires the tuple of parameters to be in decreasing lexicographic ordering.The last case can be overridden with using clause decreases %[ 1 . .  ], which explicitly chooses a lexicographic ordering on arguments.To conclude, one can use F* to write effectful programs, specify them using dependent and refinement types, verify them using an SMT solver or providing interactive proofs.Programs written in F* can be translated to OCaml or F# for further execution.

Verification of transformation of CFG to CNF
In this section we briefly describe some necessary aspects of the theory of formal languages, sketch a totality proof for one of grammar transformations to Chomsky normal form in F*, and formulate some advantages and disadvantages of this approach.

Context-free grammar and Chomsky normal form
In this section we give basic definitions and formulate a theorem that helps us to verify the implemented algorithm of a transformation of context-free grammar to Chomsky normal form.In formal language theory, a context-free grammar (CFG) is a formal grammar in which every production rule is of the form  → , where  is single nonterminal symbol and  is a string of terminals and/or nonterminals ( can be empty).Context-free grammar is said to be in Chomsky normal form (CNF) if all of its production rules are of the form: where , B and  are nonterminal symbols,  is a terminal symbol,  is the start nonterminal, and  denotes the empty string.Also, neither  nor  may be the start symbol, and the third production rule can only appear if  is in (), namely, the language produced by the context-free grammar .Context-free grammars given in Chomsky normal form are very convenient to use.It is often assumed that either CFGs are given in CNF from the beginning or there is an intermediate step of normalization.Having a certified implementation of normalization for CFGs enables us to stop thinking in terms of CFG and consider grammar in CNF without losing guarantees of correctness.CFG normalization theorem: There is an algorithm which converts any CFG into an equivalent one in Chomsky normal form.The full normalization transformation for a CFG is a composition of the following constituent transformations.
 Eliminate all -rules.
 Eliminate all chain rules.
 For each terminal , add a new rule  → , where  is a "fresh'' nonterminal and replacing  in the right-hand sides of all rules with length at least two with .

Verification with F*
Our purpose is to verify a core YaccConstructor (YC) using F*.YC is an open source modular tool for research in lexical and syntax analysis and its main development language is F# [17].In this paper we consider only a verification of normalization grammar algorithm [18] which is defined in a following way: The function toCNF is a composition of the four transformations mentioned.Notice that the order of rules execution is important.The first rule must be executed before the second, otherwise normalization time may increase to (2  ).The third rule follows the second, because elimination of -rules may produce new chain rules.Also, the fourth rule must be executed after the second and the third as they can generate useless symbols.F*, as a proof assistant, allows to formulate and prove properties of function of interest using lemmas or enriching types.For example, in F# function (f (x:int)= 2*x) is inferred to have type (int -> int), while in F* we infer (int -> Tot int).This indicates that (f (x:int) = 2*x) is a pure total function which always evaluates to int.A lemma is a ghost total function that always returns the single unit value ().When we specify a total function, we have to prove totality of every nested function, because F* supports only high-level annotations.In others words, we cannot add annotation for a nested function.Therefore, to prove totality of a function containing nested functions, we need to lift all nested functions up and explicitly prove totality of these functions.We describe each function of interest in an individual module to avoid namespace collision.We use module architecture similar to YC architecture.Module IL contains type constructors for describing productions of a grammar.Module Namer contains a function to generate new names.Finally, we created individual modules for each transformation and a separate, main, module which contains the definition of toCNF transformation.
We implemented all the transformations in F* [19], but in this paper we consider only one of them, namely SplitLongRules, which eliminates long rules.Firstly, we describe all the helpers we need, prove their totality and other necessary properties, and then explain why this transformation is correct.
In the first transformation, it is necessary to create new nonterminals, so we need a function to supply them.The function Namer.newSourcedefined below is used.
val newSource: n:int -> oldSource:Source -> Tot Source let newSource n old = ({old with text = old.text^(string_of_intn)}) Integer n is equal to the size of the list of rules which we have at the moment of function Namer.newSourcecall.Obviously, function Namer.newSource is injective.In other words, unique rule names remain unique after application splitLongRules.
Some necessary helpers are grouped in TransformAux module: for example, functions createRule and createDefaultElem, which take some arguments and return Rule and Elem respectively.Elem is the right part of the rule if the latter is a sequence.Also, we define follow one simple function which returns the length of the right part of the rule.There are some peculiarities in our implementation, which are worth mentioning.One of them is the representation of the right-hand side of the rules by lists.In the algorithm, we need to cut off two last elements of a rule, so we carry out the following steps.
let revEls = List.revelements let cutOffEls = [List.Tot.hdrevEls; List.Tot.hd(List.Tot.tlrevEls)] Functions List.hd and List.tl from a standard library are not defined for an empty list, so they cannot be considered total, which limits their usage in our code.In F* there is a module List.Tot which provide proper total analogs of the functions mentioned.We only provide their signature here.
val hd: l:list 'a{is_Cons l} -> Tot 'a val tl: l:list 'a{is_Cons l} -> Tot (list 'a) Predicate is_Cons takes a list as an input and returns false if it is empty, otherwise it returns true.If function List.Tot.hd is applied to a list, nonemptiness of which is not clear from the context, F* reports a type mismatch.A pleasant peculiarity of F* is that in some rare cases it can derive necessary properties.In our implementation of the transformation, only the rules which have more than two symbols in the righthand side are split.In this case F* is able to automatically derive required type, so we can choose two elements.It can be illustrated with the following example.We proved totality of all the nested functions.Now we want to prove termination of the general one.In our case, it is sufficient that the length of the rule strictly decreases on each recursive call and we are not interested in the length of the accumulator.To prove this we must explicitly specify that after applying List.Tot.tl to a list, its length reduces by 1. So, we must use the same method as we used before.With this sufficient information F* has to conclude that cutRule is total.Function splitLongRules takes a list of rules and applies cutRule to each rule, then concatenates all the results and returns the combined list.Totality is proved automatically by F*.
Previously we proved totality of our transformation, but we had not mentioned properties of the rules we get after applying splitLongRules.We add restriction on the type of function, which guarantees the necessary property of the result, instead of proving the lemma about these properties.The function signature now look like this.Now we have almost everything we need to prove such properties.We have to provide some additional information so that F* could check arguments type when collect is recursively called.At the moment of cutting the rule off, we should fix the length in the type of the cut part.For this purpose we have to define a function to take our list and return part with that type.Further, we have to prove lemma that states that concatenation of two lists with short rules is the list with short rules.After that F* accepts type correctness.

Advantages and disadvantages of F*
In this section we want to outline some advantages and disadvantages of F* programming In F#, even if there is no doubt that some functions are correct, an incorrect result may still be obtained by applying them in a wrong order.F* can prevent such situations, if a programmer specifies the properties demanded from an input data in a function type.For instance, deleteChainRules should only be applied after deleting epsilon rules.This can be ensured by specifying the following signature of deleteChainRules function (where predicate has_no_eps_rules checks that there are no epsilon rules).Unfortunately, there are some disadvantages of F* which we want to emphasize.First of all, it does not provide anyeven primitivesupport for object-oriented features.
One can use structures instead of classes, but it complicates development.For example, we had to explicitly create functions for constructing elements of types.In other words, rather than create class Person with constructors and methods: let person = new Person("Nick", 27) One has to write code in a rather cumbersome manner:

let new_Person name age = {name=name; age=age}
There is a special construct in many functional languages which checks whether some property holds for a value.Such construct is called guard in Haskell and when in OCaml and F# and is often used in pattern matching to simplify code.Unfortunately, it is not supported in F* and one can only hope that it will be supported in the latter language versions.Lastly, we can notice poor quality of error reporting in F* which sometimes makes it hard to understand why proofs do not pass correctness tests.

Conclusion and future work
We presented a verification of one of transformations of context-free grammar to the Chomsky normal form.We proved totality of each function implemented, as this property guarantees that computations always terminate and do not have side effects, which is useful in practice.Although for a complete proof of the correctness of the grammar transformation we still need to prove the equivalence of the original and the resulting grammar, we have already obtained interesting results.We can specify an input and an output of functionsusing refinement and dependent typesthat allows us to establish application order of the four transformations, by means of which correctness of the whole transformation is guaranteed.We use programming language F* to verify the implementation, but to be able to execute it one needs to extract it to OCaml or F# and then compile it using the OCaml or F# compiler respectively.At the moment, the mechanism of extraction code from F* to F# omits casts, erases dependent types, higher rank polymorphism and ghost computation [9].These features are very important and lack of them breaks the consistency and correctness of programs within the target language.F* is currently under active development, and implementation of the extraction mechanism which copes with the above shortcoming is actual topic of our further research.