GENERATING TYPE-SAFE SCRIPT LANGUAGES FROM FUNCTIONAL APIS

It is often useful to expose an Application Programming Interface (API) as a scripting language when developing complex applications, especially when many teams are working on the same product. This allows for a solid separation of concerns and enables rapid development using the scripting language. However, to expose an API might involve huge amount of effort and a common problem with these scripting languages is their type-unsafe nature which can easily result in issues that are hard to debug. Our solution is to generate an interpreter from a functional API utilizing C++ meta-programming techniques. The generated script language is type safe. Using our libraries it takes minimal effort to generate an interpreter from such an API. We also present a case study with a widely used EDSL. This method also turned out to be a more general solution to several other problems, including providing type-safe auto-completion support for the interpreted language or implementing a plug-in system that enables fast prototyping while remaining type-safe.


INTRODUCTION
Compiled languages provide us with several advantages.These advantages include superior performance and static guarantees.However compilation can take a significant amount of time which degrades the productivity of the developers.Moreover, once the compilation is finished one needs to stop the old instance of the software and start a new one to utilize the new version.Sometimes initializing a software can be even more time consuming than the compilation.One possible solution would be to use dynamic libraries, and reload them on demand.However, this solution is platform-dependent and compilation still takes time.
Interpreted languages are usually not type-checked, and are therefore more error prone.On the other hand, a fast and iterative trial-and-error method of development works fairly well in an interpreted environment.Due to the lack of in-depth static analysis of the code, the vast majority of the errors are only detected during runtime.These languages are not capable of systems programming, in part because of their poor performance compared to compiled languages.However, since most of the platform-dependent details are usually abstracted away by the interpreter, such scripting languages tend to be portable.
It is not unusual to implement the performance critical part of a software in a compiled language and expose an Application Programming Interface (API) to an interpreted language.Unfortunately, this involves a lot of work, and the scripts require huge amounts of testing, because code on unexplored code paths is not verified by the interpreter.This paper presents a solution to automate the generation of an interpreter for a type-safe scripting language using the type information of the hosting compiled language.This method can save significant amount of time and effort in programming as well as debugging (due to type-checking).
First we present the problem and discuss the general methods that we are using to cope with it.Then we propose a possible implementation to solve the problem.After that there is a case study about our current solution, followed by discussing related works.At the end we summarize the paper.

GENERAL METHOD
An API or an Embedded Domain Specific Language (EDSL) in a statically typed language provides the user with some restrictions how it can be used thanks to the type system.These restrictions prevent some misuses of the API.When an API is exposed to a scripting language that is dynamically typed, the restrictions enforced by the type system are no longer protecting the user from such potential errors.One possible solution would be to modify the grammar of the scripting language to have similar restrictions as to what the type system provides.However, most of the script languages today do not give us the flexibility to modify their grammar.Another solution would be to use a statically typed script language and expose all of the type information.This interpreter of this language can be generated from the source code and the language tends to be domain specific.Both solutions require redundant efforts from the developers.Once the type information is available in the host language, why should we spend time to duplicate it by describing those restrictions to the script language's interpreter?The main source of the problem is not the initial implementation, but the constant overhead of that every time the API changes, we have to update the bindings to the scripting language as well.This is a tedious and also error-prone process.
We created a metaprogram library which is able to use the limited capabilities of C++'s compile-time type introspection to derive grammar rules from the type information of the functions and functors making up the API 1.If we use this library to expose the API to a special scripting language, the cost of maintenance for the bindings will be significantly reduced.The library needs the enumeration of the functions and functors that are the part of the API.When a new function is introduced, or an old one is deleted, only this list has to be maintained.However, when the type signature of a function changes, no changes are required.Also, if a new conversion operator is introduced between two types, the library will still automatically generate the correct grammar.
It is also better for the user of the API to use such an interpreter, because the generated grammar makes some classes of errors impossible.For this reason, the user does not need to write tests for those kind of errors, saving considerable amount of time and effort.

IMPLEMENTATION
The introduced technique is implemented in C++ [7] and utilizes a large amount of template metaprogramming.Templates in C++ provide us with a Turing-complete language [9] that is evaluated at compile time.It is also possible to generate whole software components -such as interpreters -solely using metaprograms.This is called generative programming [8].However, the proposed solution is not unique to C++; it is possible to implement it in any language that has similar support for metaprogramming, such as D [17].An alternative approach to template metaprogramming is dynamic introspection (reflection), which allows a similar technique to be used in languages such as Java.
The basic idea is that the type information of an API is available to the compiler.Pattern matching with template specializations and relying on the Substitution Failure Is Not An Error (SFINAE) [10] principle of C++ allow us to encode type convertibility information in a matrix of boolean values that are known compile-time constants.These values will then determine how the API is allowed to be used, as the semantic analyzer of the scripting language is generated by a metaprogram based on those values.The parser of this language is not generated by a metaprogram in this solution, however it is possible to generate parsers in compile time [6] [15].
Our solution is able to cope with both functions and functors (objects behaving as functions), however creating instances of callable objects can be non-trivial.However, functors are advantageous for describing exposable APIs because they can be part of a class hierarchy, allowing us to take advantage of runtime polymorphism.A frequently used method to generate EDSLs in C++ is to use the generalized command pattern [1] with functors.Such EDSLs can be easily exposed to an interpreter using our library.Some APIs require relatively small amount of additional work.Potential source of problems are functions with several overloads.Those functions have to be disambiguated by static casts.

Strings in metaprograms
The first challenge is one that may not be immediately obvious: string handling.This is very much non-trivial in metaprograms, primarily due to the fact that (for technical reasons) a string literal is not (and cannot be) a valid template parameter.The source of the problem is that the semantic analyzer will work on function names -which are strings.The method for effectively work with compile-time strings was developed by Ábel Sinkovics [14].We have slightly altered his approach to make it easier to create runtime strings from compile-time strings.Sinkovics' main idea was to create a macro, a metaprogram and a constexpr function to reduce a character string literal to a list of characters.His method generates some trailing zeros which can also be easily removed during compile-time.He created a Boost MPL [11] vector from the characters, but we stored the characters in a template parameter pack instead.This made it possible to create a runtime string easily by utilizing the C++11 uniform initialization syntax [7].The MetaStringImpl class is responsible for removing the trailing zeros, but its implementation is not relevant to this paper.The GetRuntimeStr method of the Accumulator class shows how easy it is to create runtime strings from this representation.s t a t i c s t d : : s t r i n g G e t R u n t i m e S t r ( ) { r e t u r n s t r : : G e t R u n t i m e S t r i n g ( ) ; } } ;

Storing type information
The next obstacle is how to provide all the necessary information for the automata (the semantic analyzer) in a convenient way.We used preprocessor macros to make it easy for the user to register functions -this is unavoidable, given the lack of introspection in C++.The information involves the type signature of the callable object, the name of the function to identify it, and the type of an object to be instantiated.We use tuples to store all the necessary type information.Template metaprograms primarily work with types.In case of functors, it is easy to supply all of the information -however, for functions, it is important to box the function pointer into a functor type.The FunctionPointer class does exactly that, making it easy to work with function pointers in our metaprograms.The automata is generated from a list of those tuples.In fact, if we want to instantiate the boxed function pointers, we will face another issue: the C++ functions can not be overloaded on return types.To solve this issue we added a common base class to all of the function pointer boxes.The interface of the instantiation function will return a pointer to the common base, which points to the allocated derived box on the heap.

Generating the analyzer
What the automata actually does is generating several matrices: one for each possible argument.Currently, the number of matrices is a pre-configured constant and it is equal to the maximum arity (number of parameters) supported by the automata, but it is possible to deduce this number from the input and it is a target for future development.
Each matrix is indexed with function names.The nodes of the syntax tree which we want to analyze are representing function compositions such as A(..., B(...), ...), where the result of B is applied to the ith parameter of function A. Let's call the ith matrix M i .The node of the syntax tree representing the call above is valid if and only if the value of M i [A, B] is true.This implies that M i [A, B] holds the information of whether the return type of B is implicitly convertible to the type of the ith formal parameter of A. To make it work, the metaprogram have to generate all the matrices at compile time, and generate the code to lookup the values in those matrices.Right now the lookup in the matrix involves an O(N 2 ) complex function call chain where N is the number of functions in the API, which can result in stack overflow in moderately sized APIs -however, it is possible to reduce it to O(1) long call chain if we use the information to build a hash maps instead of generating a recursive function chain.
To show how easy it is to work with this metaprogram, let's study a trivial example.Suppose we have the following tiny API with a dummy implementation: EXPECT EQ ( e x p e c t e d , r e s u l t ) ; Now, if we want to, for example, know which functions can appear as the first argument of func1, we can use the GetComposables function of the generated automata.Here, because casting a pointer to a derived class to a base class pointer is a valid implicit conversion, the functions returning pointer to A or to a descendant of A are all valid functions.Those are func1, func2 and the functor D in this example.
The template parameters of Automata are in fact tuples of three components describing a function: the type signature of the function, the object to instantiate, and the name of the function.Automata checks all the possible compositions and stores whether it is valid for all of the possible parameters and generates the lookup code.Unfortunately due to the verbosity and the noisiness of meta-programming the implementation of the lookup code generation is more than 200 lines of code.The lookup code operates solely on runtime strings.The current implementation involves a relatively time consuming lookup, but in the future it is possible to generate code that initializes a hash table for lookup instead of generating the lookup code directly.This future improvement will improve the performance of the interpreted scripting languages significantly.

CASE STUDY
We developed a tool, that executes queries on a C++ codebase.The tool is based on an EDSL that is already available in Clang.
Clang [4] is a modern C++ compiler, consisting of a set of a libraries each responsible for a separate task of the compiler.It was designed for tools to build on top of its libraries.One of these libraries provides an embedded domain-specific language [2] [3] [5] called AST matchers for expressing patterns in the abstract syntax tree of the subject code.It basically exposes a number of predicate functions (and functors) that can be composed together to form a pattern.
C++ is one of the most complex languages in existence.An AST built from C++ source code naturally has to reflect this complexity.This poses a problem for tool developers who want to use the matcher library: forming correct patterns is very much non-trivial.This often means that developer has to adapt a trial-and-error work flow, as he is trying to piece together a correct AST matcher expression.This is due to the fact that he has no easy way of testing the matcher, short of re-compiling the tool, re-parsing the subject code, and running the matchers against the built AST.This overall makes for an extremely inefficient work-flow.
To solve this, we are building a Read Eval Print Loop (REPL) interface that provides instant feedback [12].The underlying engine is responsible for parsing the provided matcher expression into a primitive AST, and translating it into a matcher object.It is of course absolutely critical to ensure semantic correctness.It is also imperative that adding support for new AST matchers is very easy, as Clang is rapidly developing.
The tool we developed relied on an interpreted query language to run queries against C++ code bases -more specifically, their Abstract Syntax Tree (AST).The general design of the tool is shown on Figure 2. The input of the tool is an AST matcher expression as query string, that is, an expression describing a pattern to be searched for in the target AST, much like how regular expressions describe patterns over strings (text).The front-end is responsible for parsing the query, resulting in a syntax tree of the query expression itself.This task is fairly trivial, as the matcher expressions consist of only function calls and function composition, besides constant literals.Any lexical or syntactical errors are emitted during parsing.The next stage is semantic analysis: type-checking the function compositions.The verifier is generated by a metaprogram, based on the type information available to the compiler.This will detect type mismatches while building the compiled matcher expression from the query's syntax tree.The created matcher expression will be executed by the execution module, matching the pattern defined by the expression against the AST of the target C++ code base.
The module responsible for semantic analysis can be also used to provide auto-completion, in order to help the users to write such matcher expressions by listing all the possible type-safe subexpressions while editing a query.The topic of the present paper is the design and implementation of the metaprogram which generates this module.

RELATED WORK
There are several alternatives such as Simplified Wrapper and Interface Generator (SWIG) [16] which is an external tool that can generate such glue code for many existing scripting and compiled languages.However, the supported scripting languages did not meet our requirements.One of the problems is that it does not support statically typed scripting languages.The other issue is that it does not handle EDSLs written in C++ well.
Comprehending DSLs is not always an easy task [12].However if a REPL is available for the developers to experiment with, it makes it much easier for new developers to explore and understand the DSL.
Using solely C++ for generating bindings for an interpreter is not a unique idea.The Boost.Python [18] library is taking this approach.However it is only limited to Python which was not suitable for our purposes.

SUMMARY
Generic programming is a very popular paradigm for solving various tasks.There are also some automated tools to generate bindings to a language.However, as it turned out, the C++ programming language is capable of generating type-safe bindings to a language without any external tools.This shows off the power of compile-time computations and the richness of information provided by static typing.We created a library that makes it straightforward for the users to expose an EDSL or an API to a scripting language and significantly reduce the maintenance cost of the created bindings.The upcoming C++17 standard is expected to introduce compile-time reflection to the language.This will help us to write significantly more efficient code and greatly reduce the size of our code base.Moreover, it will open up new possibilities, such as automatically exposing whole classes to an object oriented type-safe scripting language.There are still several tasks waiting for us to achieve a completely robust solution, but the results so far look promising.

Fig. 1
Fig. 1 General method # d e f i n e DO( z , n , s ) a t ( s , n ) , # d e f i n e S ( s ) \ BOOST PP REPEAT ( STRING MAX LENGTH , \ DO, s ) t e m p l a t e < i n t N> c o n s t e x p r c h a r a t ( c h a r c o n s t (& s ) [ N] , i n t i ) { r e t u r n i >= N ?' \ 0 ' : s [ i ] ; } t e m p l a t e <c h a r . . .cs> s t r u c t A c c u m u l a t o r { s t a t i c s t d : : s t r i n g G e t R u n t i m e S t r ( ) { r e t u r n { c s . . .} ; } } ; t e m p l a t e <t y p e n a m e T , c h a r . . .> s t r u c t M e t a S t r i n g I m p l ; t e m p l a t e <c h a r . . .cs> s t r u c t M e t a S t r i n g { t y p e d e f t y p e n a m e M e t a S t r i n g I m p l < ISSN 1335-8243 (print) c 2013 FEI TUKE ISSN 1338-3957 (online), www.aei.tuke.skA c c u m u l a t o r <>, c s . . .> : : r e s u l t s t r ; t e m p l a t e <t y p e n a m e T , T * p t r > s t r u c t F u n c t i o n P o i n t e r ; t e m p l a t e <t y p e n a m e Ret , t y p e n a m e . . .Args , R e t ( * f ) ( Args . . . ) > s t r u c t F u n c t i o n P o i n t e r <R e t ( Args . . . ) , f> { R e t o p e r a t o r ( ) ( Args . . .a r g s ) { r e t u r n f ( a r g s . . . ) ; } } ; # d e f i n e FUNCTION ( x ) \ s t d : : t u p l e <d e c l t y p e ( x ) , \ F u n c t i o n P o i n t e r <d e c l t y p e ( x ) , &x > ,\ M e t a S t r i n g < S ( # x )>> # d e f i n e FUNCTOR( x ) \ s t d : : t u p l e < \ d e c l t y p e (&x : : o p e r a t o r ( ) ) , x , \ M e t a S t r i n g < S ( # x )>> s t r u c t A { } ; s t r u c t B : A { } ; s t r u c t C { } ; s t r u c t E : B { } ; A * f u n c 1 (A * ) { r e t u r n 0 ; } B * f u n c 2 (A * , B * ) { r e t u r n 0 ; } C * f u n c 3 (A * ) { r e t u r n 0 ; } s t r u c t D { E * o p e r a t o r ( ) (A * ) { r e t u r n 0 ; } } ; To create the automata we only need to enumerate the functions of our API with the corresponding macros: t y p e d e f t y p e n a m e Automata< FUNCTION ( f u n c 1 ) , FUNCTION ( f u n c 2 ) , FUNCTION ( f u n c 3 ) , FUNCTOR(D) >:: r e s u l t GA; s t d : : s e t <s t d : : s t r i n g > e x p e c t e d {" f u n c 1 " , " f u n c 2 " , "D" } ; a u t o tmp = ISSN 1335-8243 (print) c 2013 FEI TUKE ISSN 1338-3957 (online), www.aei.tuke.skGA : : G e t C o m p o s a b l e s ( " f u n c 1 " , 0 ) ; s t d : : s e t <s t d : : s t r i n g > r e s u l t ( tmp .b e g i n ( ) , tmp .end ( ) ) ;

Fig. 2
Fig. 2 Design of the Query Program