MetaR: simple, high-level languages for data analysis with the R ecosystem

Data analysis tools have become essential to the study of biology. Here, we applied language workbench technology (LWT) to create data analysis languages tailored for biologists with a diverse range of experience: from beginners with no programming experience to expert bioinformaticians and statisticians. A key novelty of our approach is its ability to blend user interface with scripting in a single platform. This feature helps beginners and experts alike analyze data more productively. This new approach has several advantages over state of the art approaches currently popular for data analysis: experts can design simplified data analysis languages that require no programming experience, and behave like graphical user interfaces, yet have the advantages of scripting. We report on such a simple language, called MetaR, which we have used to teach complete beginners how to call differentially expressed genes and build heatmaps. We found that beginners can complete this task in less than 2 hours with MetaR, when more traditional teaching with R and its packages would require several training sessions (6-24hrs). Furthermore, MetaR seamlessly integrates with docker to enable reproducibility of analyses and simplified R package installations during training sessions. We used the same approach to develop the first composable R language. A composable language is a language that can be extended with micro-languages. We illustrate this capability with a Biomart micro-language designed to compose with R and help R programmers query Biomart interactively to assemble specific queries to retrieve data, (The same micro-language also composes with MetaR to help beginners query Biomart.) Our teaching experience suggests that language design with LWT can be a compelling approach for developing intelligent data analysis tools and can accelerate training for common data analysis task. LWT offers an interactive environment with the potential to promote exchanges between beginner and expert data analysts.


INTRODUCTION
presents an overview of the features offered by MetaR and Composable R, for the full range of 84 users that the platform supports, from beginner to expert.

Beginner Intermediate Expert
Level of computational and programming experience of a user  We present the capabilities of the MetaR platform organized by the level of experience of a user. Beginners mostly benefit from the ability to blend graphical interfaces with scripting, and from high-level languages developed by experts on the platform. Intermediate users, who have basic programming skills, are able to customize the languages in simple ways, such as by creating intentions to help with repetitive steps of analyses. Intentions are context-dependent actions that can be added to a language at runtime (see Simi and Campagne [2014] for illustrations). Experts are users with strong programming skills who have become familiar with LWT. They can create micro-languages to extend Composable R, or design entirely new data analysis languages to help beginners with analysis for new domains. Users at all levels benefit from LWT platform features, including seamless integration of the languages with version control (see Benson and Campagne [2015] for a discussion of the integration with version control).

85
Teaching the MetaR data analysis language 86 Teaching a data analysis tool can smooth the learning curve and prevent un-necessary frustration that 87 students could experience if they tried working with the tool on their own. Since MetaR is a new platform 88 for data analysis, training is also important to help users get started with the software. language (such as Java, or C) but this would require implementing all aspects of data manipulation in the 131 target language. Since the R language (Ihaka and Gentleman [1996]) is widely used for data analysis in 132 biology, we considered using it as a runtime system. Experts biostatisticians and bioinformaticians have 133 developed many R packages that implement advanced analysis for biological high-throughput data. These 134 packages can be used to simplify the implementation of a runtime system for a new data analysis language. 135 We therefore decided that the MetaR language would generate R code in order to take advantage of the 136 packages developed in this language. This decision greatly simplified the implementation of the MetaR 137 language because it removed the need to develop a custom language runtime system. can be annotated with one or more Column Groups.

184
Users can define arbitrary Column Groups in a different node called "Column Groups and Usages" 185 (shown on the right of Figure 2). If two columns are related, user can define a Group Usage to explicitly  which columns contain read counts, the ID column group, which uniquely identifies specific rows of the 202 data table and the heatmap column group, used to choose which columns groups should be heatmap.

203
This illustrates that the table annotation mechanism is flexible and can be leveraged by specific statements 204 of the language, in order to indicate that the statement needs data annotated in a certain way.

206
Analyses make it possible for users to express how data is to be analyzed. Figure  • Auto-completion offers a convenient way to set references between objects. Accepting an auto-218 completion suggestion helps users avoid typos.

219
• Some users choose not to use auto-completion to set references and instead type a referenced node 220 name. In this case, mis-typed names that cannot be resolved to a valid node are highlighted in red  render plot as PDF named "heatmap.pdf " . . . 72dpi write Results to "results.tsv " . . . Figure 3. MetaR Analysis. The Analysis node is composed of a list of statements. This analysis works with the table of data presented in Figure 2, removes the row of data where the value Total appears in the gene column, performs statistical modeling with Limma Voom to identify genes differentially expressed between LPS treated and control samples, constructs a heatmap and displays the plot as a preview. Finally, the analysis converts the plot to PDF format and writes the joined table (statistics and counts) in the results.tsv file.
Auto-completion help is available for the various types of references supported by the MetaR language.

227
Examples of these can be seen on Figure 3 for tables (whose names are in green), plots (whose names are 228 in blue), styles (names shown with a green background and white foreground, such as HeatmapStyle), or 229 Column Group names (shown with a blue grey background and black foreground). Pressing control-B

230
(or command-B on Mac) with the cursor on these nodes navigates to the destination of the reference (a 231 menu is also available to help novice users discover this navigation mechanism). References may point to 232 children nodes defined inside an analysis (e.g., plots), or nodes defined outside the analysis (e.g., tables 233 and column groups).

234
Importantly, the MetaR user interface can also display buttons and images directly as part of the and many readers may be therefore unfamiliar with this technique. We will use an example to explain the 247 advantage of this technique for data analysis.

248
Consider the table of results produced by the analysis shown in Figure 3. Users are likely to need 249 to annotate the subset of genes found differentially expressed with gene names and gene descriptions.

250
Information such as this is available in the Biomart system Haider et al. [2009].

251
To illustrate language composition, we created a new kind of MetaR statement called query 252 biomart, which we defined in a micro-language. A micro-language is a language which provides 253 only a few concepts meant to extend a host language. In this case, the MetaR language is the host 254 language and query biomart is a concept contributed by the the micro-language. The purpose of this 255 concept is to connect to Biomart and retrieve data. In the R language, this functionality is provided as a  Figure 4. Example of Micro-language Composition. The query biomart stament is defined in a micro-language called org.campagnelab.metar.biomart, which extends the host language org.campagnelab.metar.tables. The biomart language provides one statement that offers an interactive user interface to help users retrieve data from biomart. This language reuses expressions and tables from the host language. Micro-languages can be enabled or disabled dynamically by the end-user at the level of a model. This example retrieves Human ENSEMBL identifiers and gene descriptions using the HGNC gene symbols used as identifiers in the Results table (see Figure 3 for the analysis that produced Results).

9/17
the host language. We demonstrate this capability by adapting the query biomart statement shown in 287 Figure 4 to the R language. Adaptation is simple because both MetaR and R generate to the same target 288 language (R). In this case, we create a sub-concept of Expr (this type represents any R expression), and 289 define a field of type Biomart (the concept that implements query biomart). This simple adapter is 290 sufficient to make it possible to use the query biomart user interface inside an R script and is defined 291 in the language org.campagnelab.metar.biomartToR. The result of composing the adapter language with 292 composable R is shown in Figure 6. We also provide a short video to illustrate the interactive capabilities 293 of a micro-language combined with composable R (see https://youtu.be/ZwGj1RPOODQ).

294
This example illustrates that a composable R language makes it possible to mix regular R code with 295 new types of language constructs that can include user interfaces elements. This opens up new possibilities 296 to facilitate repetitive analyses in R, for instance for specific data science domains (e.g., the Biomart 297 example is useful for bioinformatic data analyses), but also for more general activities where simpler ways 298 to perform a task would be beneficial. An example of this would be a micro-language to facilitate the use 299 of packages to replace the boiler-plate package import code found at the beginning of most R scripts.

300
QueryBiomart InR.R if ( ! require("data.  Figure 6. Composing Query Biomart with the composable R language. We developed an adapter that makes it possible to use the MetaR query biomart statement directly inside a composable R Script. This figure shows how the query biomart Expression adapter appears when used inside an R script. Notice how the table and column adapters are used inside a regular hist() function call resultFromBioMart$percent identity from aflavus homologs. These adapters make it possible to refer to the table produced by the statement as an R expression and provide auto-completion for column names in the table (determined dynamically based on the query expressed in the query biomart statement). Figure 7 illustrates that language composition can also be used to embed R expressions inside a MetaR 302 analysis. This extension is possible because both analyses and R expressions generate code compatible 303 with the syntax of the R programming language. Providing a way to embed the full language in a simpler 304 analysis language offers a guarantee that the end-user will not be overly limited by restrictions of the 305 simpler language. Figure 7. Composing R Expressions with the MetaR Language. Top panel: this example illustrates that it is possible to use R code inside a MetaR analysis. In this snapshot, R code is delimited by the -R and R -markers and shown with a blue background. Embedding R code in MetaR provides flexibility to perform operations for which MetaR statements have not yet been developed. The analysis shown simulates a dataset using simple parameters and tests the ability of Limma voom, as integrated with MetaR, to call differentially expressed genes. Bottom panel: shows the result of executing the analysis inside the MPS LW. As part of execution, the analysis is converted to R code, this code is run and standard output displayed inside the LW. The STATEMENT EXECUTED// lines hyperlink the progress of the execution with each specific analysis statement that has been executed.   are relatively easy to learn), more advanced users who need to perform similar analyses across several 335 datasets tend to strongly prefer analysis software that does not require repeating interactions with a GUI 336 for every new dataset that must be studied. The novel approaches we have used to develop MetaR share 337 these advantages with GUIs.

338
A minority of analysis software with GUIs also supports writing and running scripts in their user 339 interface. For instance, JMP from SAS Inc. is an example of a statistical analysis software with GUI that 340 also offers a scripting language. However, when scripting is offered, it is often only loosely integrated 341 with the rest of the interface. Furthermore, users who are familiar with the GUI often need to learn 342 scripting from scratch and do not benefit much from their prior experience using the GUI.  If the language is sufficiently general, then novice users may learn skills that they can reuse when 365 learning other general data analysis languages. If the language is too limited, then novice users would 366 only learn a specialized analysis tool similar to existing GUI analysis tools. Rigorously determining to 367 which category the MetaR language belongs would require following users for several months or years 368 while they use the tool and we have not done such a study. However, we think that MetaR can help users 369 transition to more general languages for the following reasons.

370
First, users who learn the high-level MetaR language acquire basic skills that are similar to those representation of the source code into an abstract syntax tree (AST), a data structure used when analyzing 446 and transforming programming languages into machine code.

447
In the MPS LW, the AST is also a central data structure, but the parsing elements of the compilers are

455
The choice of serialization rather than encoding with text has a profound consequence. Serialization Benson and Campagne [2015]. In this manuscript, we extensively use language composition to extend the 460 R language and provide the ability to embed user interfaces into R programs.

461
Abstract Syntax Tree (AST) 462 An AST is a data structure traditionally used by compilers as a step towards generating machine code.

463
In the MPS Language Workbench, an AST is a tree data structure, where nodes of the tree are instances 464 of concepts (in the object-oriented sense).   . Each concept is connected to other concepts with an open-ended arrow to indicate inheritance (e.g., A <-B indicates that B is a sub-concept of A). Green boxes indicate fields of a concept and are connected to the concept that has these fields by a line with a black diamond on the concept that owns the field. This shows that BinaryExpression is a concept that is an Expression and has three fields: left, operator and right. Dotted lines connect nodes to their concept. For instance, the <-and + nodes are instances of BinaryExpression.  node that represent a FutureTable (created when running the R script generated from the MetaR Analysis), 495 and the viewer is opened, it tries to load the data file that the analysis would create for this table. If the file 496 is found, the content is displayed using a Java Swing Component in the MPS user interface of the Table   497 Viewer tool.

502
In order to facilitate reproducible execution, we implemented optional execution within a Docker container.

503
A docker image was created to contain a Linux operating system and a recent distribution of the R language 504 (provided in the rocker-base image), as well as several R packages needed when executing the MetaR 505 statements. The Run Configuration was modified to enable execution inside a docker container when the 506 user selects a checkbox "execute inside docker container". Information necessary to run with docker (i.e.,