Sustainable, Extensible Documentation Generation using inlinedocs

The concept of structured, interwoven code and documentation has existed for many years, but existing systems that implement this for the R programming language do not tightly integrate with R code, leading to several drawbacks. This article attempts to address these issues and presents 2 contributions for documentation generation for the R community. First, we propose a new syntax for inline documentation of R code within comments adjacent to the relevant code, which allows for highly readable and maintainable code and documentation. Second, we propose an extensible system for parsing these comments, which allows the syntax to be easily augmented.


Introduction
The standard way to distribute R code is in a package along with Rd files that document the code.There are several existing methods for documenting a package by writing R comments, which are later processed and converted into standard Rd files.We first review these efforts, emphasizing the key issues that justify the introduction of a new package like inlinedocs.

Existing documentation generation systems for R
For report generation and literate programming, the mature Sweave (Leisch 2003) format allows integration of R code and results within L A T E X documents (Lamport 1986).However, the goal of inlinedocs is different.It aims for integration of documentation inside of R code files, to generate Rd documentation files using R code and markup in R comments.Thus for inlinedocs we need to extract the documentation specified in R code, and the Sweave system can not be easily applied to this parsing task.
The package.skeleton function that ships with base R is intended to ease the generation of Rd files from R code (R Development Core Team 2012).After specifying some input R code files or objects to use for the package, it produces some minimal documentation that must be completed using a text editor.Although package.skeleton is sufficient for creating small packages that are published once and forgotten, it offers little help for continued maintenance of packages for which Rd files are frequently updated.
The other existing approaches, Rdoc and Roxygen, attempt to address this sustainability problem using Rd generation from comments in R code (Bengtsson 2010; Danenberg 2009).The documentation is thus written closer to the code it documents, which is easier to maintain.These packages are a step toward seamless integration of code and documentation, but they have three major drawbacks: 1.They only use comments to generate documentation, ignoring the information already defined in the code.This is particularly problematic for documenting function arguments, which requires the repetition of the argument names in the function definition and the documentation.This repetition is a possible source of disagreement between code and documentation if both are not simultaneously updated.
2. The documentation for an object appears in comments above its definition.These comment blocks can grow to be quite large, and thus they tend to be far away from the relevant code.
3. Examples are defined either in comments or in supplementary R code files.Examples in comments are not easy to test and debug with the R interpreter, and supplementary R code files reintroduce the separation of code and documentation that these tools are supposed to eliminate.
There are many tools that accomplish documentation alongside code in other programming languages.Notable examples include docstrings in Lisp and Python, Javadoc for Java, and Doxygen, which supports several languages (Wikipedia 2012).These systems use large comments in headers, and do not support R. In contrast, inlinedocs is designed for R packages, uses smaller comments alongside the code, and exploits the code structure to reduce the need to repeat information in the documentation.

Documentation using inline comments
The inlinedocs package addresses the aforementioned issues by proposing a new syntax for inline documentation of R packages.Using inlinedocs, one writes documentation in comments right next to the relevant code, and examples in the ex attribute of the relevant object.By design, inlinedocs exploits the structure of the R code so that only minimal documentation comments are required, reducing duplication and simplifying code maintenance.
The remainder of the article is organized as follows.In Section 2, we discuss the details of the inlinedocs syntax for writing documentation in R comments.In Section 3, we discuss the design and implementation of inlinedocs, and explain how the syntax can be extended.In Section 4, we conclude and offer some ideas for future improvements.Finally, in Appendix A, we show a concrete application by porting the base apply function to inlinedocs.

The inlinedocs syntax for inline documentation of R packages
The main idea of inlinedocs is to document an R object using ### and ##<< comments directly adjacent to its source code.Furthermore, inlinedocs allows documentation wherever it is most relevant in the code using ##section<< comments.These special comment strings are designed to work well with the default behavior of common editing environments, such as Emacs with the Emacs Speaks Statistics (Rossini, Heiberger, Sparapani, Maechler, and Hornik 2004) add-on package: • ### is aligned to the left margin, providing maximum space for comment text.
• ##<< is aligned with the start of adjacent code lines, so that comments using this form in the middle of a function do not obscure the code structure.
The following Sections illustrate common usage of inlinedocs comments through fermat, an example package inspired by the Roxygen vignette (Danenberg 2009).The examples were processed and checked for validity using inlinedocs version 1.9.For brevity, only the most frequently used inlinedocs features will be discussed, and the reader is directed to the inlinedocs web site for complete documentation: http://inlinedocs.r-forge.r-project.org/

Documenting function arguments and return values
The following example demonstrates the minimal documentation a package author should provide for every function.Note that the location of white space, brackets, default arguments and commas is quite flexible.
(n ##<< The integer to test.){ a <-floor(runif(1,min=1,max=n)) a^n %% n == a ### Whether the integer passes the Fermat test for a randomized ### \eqn{0<a<n}.} The comments correspond to the following sections of the fermat.test.Rd file: • ### comments following the line of function form the description section.
• For each argument, an item is created in the arguments section using a ##<< comment on the same line.
• ### comments at the end of the function form the value section.
By default, name, alias and title Rd sections are set to the function name, so this minimal level of documentation is enough to make a working package that passes R CMD check with no errors or warnings.

Inline titles, arguments, and other sections
The following example shows some optional inlinedocs comments that allow detailed and flexible specification of Rd files.On the first line, the # comment specifies the title.On the lines after an argument, ### comments specify its documentation.This is a useful alternative to inline ##<< comments for longer, multi-line documentation of function arguments.
A ##section<< comment can be used anywhere within a function, for any documentation section except examples, which is handled in a special manner as shown below in section 2.3.In each comment, arbitrary Rd may be written, as shown in the ##seealso<< section above.Each ##section<< may occur several times in the documentation for a single object.Such multiple occurrences are normally concatenated as separate paragraphs, but special processing is applied to match the intended use of the following documentation sections: • title sections are concatenated into a single line.
• description sections should be brief, so are concatenated into a single paragraph.
• alias contents are split to give one alias per line of text.
• keyword contents are split at white space, each generating a separate \keyword entry.
The ### and ##<< documentation styles may be freely mixed.In general, ### or # lines are processed first, followed by any corresponding ##<< or ##section<< comments.Section 3 will explain in more detail how comments are processed.

Examples and named lists
The following code demonstrates inline documentation of named lists, and the preferred method of writing examples: On the final lines of the function definition, a ##value<< comment allows documentation of lists or data frames using the names defined in the code.The entries are documented using ##<< in the same way as function arguments, and this even works for nested lists.An alternative is to use attr(try.several.times,"ex")<-function(){code} later in the code.However, we prefer using structure since it keeps the examples near the object definition, and avoids repetition of the object name.
The simplicity of adding examples and generating a package using inlinedocs also allows for routine regression testing of functions with very little extra work.Even for small collections of functions, one can use R CMD check to run the examples and optionally check the output with reference output.

Documenting classes and methods
S3 methods may be defined using plain R, or using setConstructorS3 and setMethodS3 from the R.oo package (Bengtsson 2003).The inlinedocs package detects S3 methods using utils::getKnownS3generics and utils::findGeneric, and and updates the generated documentation automatically.S4 class declarations using the setClass function are also supported.The following example is from the source of inlinedocs: setClass("DocLink", # Link documentation among related functions ### The \code{DocLink} class provides the basis for hooking together ### documentation of related classes/functions/objects. The aim is that ### documentation sections missing from the child are inherited from ### the parent class. representation(name = "character", ##<< name of object created = "character", ##<< how created parent = "character", ##<< parent class or NA code = "character", ##<< actual source lines description = "character") ##<< preceding description block ) The inheritance referred to in this example is designed to avoid the need for repetitive documentation when defining a class hierarchy.The argument descriptions and other documentation sections default to those defined in the parent class.At present it only functions when all the definitions are within a single source file and this "documentation inheritance" is strictly linear within the file.

package.skeleton.dx for generating Rd files
The main function that the inlinedocs package provides is package.skeleton.dx,which generates Rd files for a package, and should be run before R CMD build.For example, package.skeleton.dx("fermat")processes R code found in fermat/R, and generates Rd files in fermat/man for each object in the package.The generated Rd files should be treated as object files, since any edits will be overwritten the next time the Rd files are generated.
Package authors with existing Rd files will have to convert them to inlinedocs comments manually.However, for new adopters of inlinedocs, it is possible to mix static Rd files and inlinedocs in the same package.For example, the following code specifies that file1.Rd and file2.Rd are static Rd files and so should not be generated by inlinedocs: my.parsers <-c(default.parsers,list(do.not.generate("file1","file2")))package.skeleton.dx(parsers= my.parsers)By design, inlinedocs is incapable of generating Rd files that document multiple objects, but package authors may write these Rd files manually using this mechanism.
More generally, the parsers argument to package.skeleton.dxshould be a list of Parser Functions.Next, in Section 3, we explain how to write Parser Functions.

The inlinedocs system of extensible documentation generators
The previous section explains how to write inline documentation in R code using the standard inlinedocs syntax, then process it to generate Rd files using package.skeleton.dx.For most users of inlinedocs this should be sufficient for everyday use.
For users who wish to extend the syntax of inlinedocs, here we explain the internal organization of the inlinedocs package.The two central concepts are Parser Functions and Documentation Lists.Parser Functions are used to extract documentation from R code, which is then stored in a Documentation List before writing Rd files.

Documentation Lists store the structured content of Rd files
A Documentation List is a list of lists that describes all of the documentation to write to the Rd files.The elements of the outer list correspond to Rd files in the package, and the elements of the inner list correspond to tags in an Rd file.For example, consider the following code and its corresponding Documentation List.R code Parser Functions examine the lines of code on the left that define the functions, and return the Documentation List of tags shown on the right.This list describes the tags in the Rd files that will be written for these functions.The names of the outer list specify the Rd file, and the names of the inner list specify the Rd tag.
To store parsed documentation, another intermediate representation that we considered instead of the Documentation List was the "Rd" object, as described by Murdoch and Urbanek (2009).It is a recursive structure of lists and character strings, which is similar to the Documentation List format of inlinedocs.However, we chose the Documentation List format since it allows rapid development of Parser Functions which are straightforward to read, write, and modify.

Structure of a Parser Function and forall/forfun
The job of a Parser Function is to return a Documentation List for a package.To do this, a Parser Function requires knowledge of what is defined in the package, so the arguments in The R code files in the package are concatenated into code and then parsed into objs, and the DESCRIPTION metadata is available as desc.These arguments allow complete flexibility in the construction of Parser Functions that take apart the package and extract meaningful Documentation Lists.In addition, the docs argument allows for checking of what previous Parser Functions have already extracted.
In principle, one could write a single monolithic Parser Function that extracts all tags for all Rd files for the package, then returns the entire Documentation List.However, in practice, this results in one unwieldly Parser Function that does many things and is hard to maintain.A simpler strategy is to write several smaller Parser Functions, each of which produces an inner Documentation List for a specific Rd file, such as the following: title.from.firstline<-function (src, ...) { first <-src [1] if (grepl("#", first)) { list(title = gsub("[^#]*#\\s*(.*)","\\1", first, perl = TRUE)) } else list() } This function takes src, a character vector of R code lines that define a function, and looks for a comment on the first line.If there is a comment, title.from.firstlinereturns the comment as the title in an inner Documentation List.This a very simple and readable way to define a Parser Function.
But how does this Parser Function get access to the src argument, the source code of an individual function?We introduce the forall and forfun functions, which transform an object-specific Parser Function such as title.from.firstline to a Parser Function that can work on an entire package.These functions examine the objs and docs arguments, and call the object-specific Parser Function on each object in turn.The forfun function applies to every function in the package, whereas the forall function applies to every documentation object in the package.Thus, when using a Parser Function such as forfun(title.from.firstline), the additional arguments in Table 2 can be used in the definition of title.from.firstline, in addition to the arguments in Table 1 that are passed to every Parser Function.This design choice of inlinedocs allows the development of modular Parser Functions.For example, there is one Parser Function for ### comments, another for ##<< comments, another for adding the author tag using the Author line of the DESCRIPTION file, etc.Each of these Parser Functions is relatively small and thus easy to maintain.

Argument Description o
The R object.name The name of the object.src The source code lines that define the object.doc The inner Documentation List already constructed for this object.
Table 2: Arguments passed to each Parser Function, when used with forall or forfun.

Extending the syntax with custom Parser Functions
The parsers argument to package.skeleton.dxspecifies the list of Parser Functions used to create the Documentation List.The Parser Functions will be called in sequence, and their results will be combined to form the final Documentation List that will be used to write Rd files.Thus, the inlinedocs syntax can be extended by simply writing new Parser Functions.

docs[ tags != "" ] }
We can then define a list of custom Parser Functions as follows: simple.parsers<-list(forfun(title.from.firstline),forfun(simple)) These custom Parser Functions can be used to extract the following Documentation List from the definition above of simple: List of 1 $ simple:List of 3 ..$ title : chr "a simple Parser Function" ..$ item{src}: chr "character vector of R source code."..$ value : chr "all the tags with a single pound sign." In conclusion, a new syntax for inline documentation can be quickly specified using Parser Functions, and then inlinedocs takes care of the details of converting the Documentation List to Rd files.

Conclusions and future work
We have presented inlinedocs, which is both a new syntax for inline documentation of R packages, and an extensible system for parsing this syntax and generating Rd files.It has been in development since 2009 on R-Forge (Theußl and Zeileis 2009), has seen several releases on CRAN, and has been used to generate documentation for itself and several other R packages.
In practice, we have found that inlinedocs significantly reduces the amount of time it takes to create a package that passes R CMD check.In addition, inlinedocs facilitates rapid package updates since the documentation is written in comments right next to the relevant code.
For quality assurance, we currently have implemented unit tests for Documentation Lists, which assure that Parser Functions work as described.We also have unit tests which ensure that the generated Rd passes R CMD check without errors or warnings.
A potential criticism of inlinedocs is that excessive inline comments may obscure the meaning of code.Indeed, this is a design choice, and can be seen as a bug, but we prefer to see it as a feature: the documentation is always near the object definition, for quick reference.
Currently, the inlinedocs package relies on the srcref attribute of a function to access its definition, but that mechanism does not work for S4 methods.Instead, we use parse on the source files to access S4 class definitions.In the future, we would like to develop Parser Functions that use this approach to extract documentation for S4 methods and reference classes, which are currently unsupported in inlinedocs.
For the future, we would like to make use of Rd manipulation tools such as parse_Rd, as described by Murdoch (2009).For package authors who want to convert Rd files to inlinedocs comments, we may be able to use parse_Rd to develop a converter that takes R source code and Rd documentation, then outputs R code with documentation in comments.
Also, it would be advantageous to have functions for converting Documentation Lists to and from Rd objects.For example, after converting an inner Documentation List to an Rd object, we could use the print.Rd function to write the Rd file.This could be simpler than the current system of starting from the Rd files from package.skeleton and then doing find and replace.Furthermore, a converter from Rd objects to Documentation Lists would permit unit tests for the content of the Rd generated by inlinedocs.

A. The base function apply converted to inlinedocs
In this appendix, we show a concrete application by converting the source and documentation of the base function apply to inlinedocs.

A.1. Source and inline documentation
We use the following source code and comments to define the apply function and its documentation.
Characters that are special in Rd do not need to be escaped in inlinedocs, such as % in the documentation of the FUN argument.
apply <-structure(function # Apply Functions Over Array Margins ### Returns a vector or array or list of values obtained by applying a ### function to margins of an array or matrix.(X, ##<< an array, including a matrix.MARGIN, ### a vector giving the subscripts which the function will be applied ### over.E.g., for a matrix \code{1} indicates rows, \code{2} ### indicates columns, \code{c(1, 2)} indicates rows and ### columns.Where \code{X} has named dimnames, it can be a character ### vector selecting dimension names.FUN, ### the function to be applied: see \sQuote{Details}.In the case of ### functions like \code{+}, \code{%*%}, etc., the function name ### must be backquoted or quoted.... ##<< optional arguments to \code{FUN}.){ ##keyword<< iteration array ##details<< \code{FUN} is found by a call to \code{\link{match.fun}} ## and typically is either a function or a symbol (e.g. a backquoted ## name) or a character string specifying a function to be searched ## for from the environment of the call to \code{apply}.FUN <-match.fun(FUN)##details<< If \code{X} is not an array but an object of a class ## with a non-null \code{\link{dim}} value (such as a data frame), ## \code{apply} attempts to coerce it to an array via ## \code{as.matrix}if it is two-dimensional (e.g., a data frame) or ## via \code{as.array}.dl <-length(dim(X)) if(!dl) stop("dim(X) must have a positive length") if(is.object(X))X <-if(dl == 2L) as.matrix(X) else as.array(X) ### Body of function contains no inline docs and is omitted for brevity.

A.3. Generated Rd
The Rd produced by inlinedocs is shown below.In particular, note that the % characters have been correctly escaped.\name{apply} \alias{apply} \title{Apply Functions Over Array Margins} \description{Returns a vector or array or list of values obtained by applying a function to margins of an array or matrix.}\usage{apply(X, MARGIN, FUN, ...)} \arguments{ \item{X}{an array, including a matrix.}\item{MARGIN}{a vector giving the subscripts which the function will be applied over.E.g., for a matrix \code{1} indicates rows, \code{2} indicates columns, \code{c(1, 2)} indicates rows and columns.Where \code{X} has named dimnames, it can be a character vector selecting dimension names.}\item{FUN}{the function to be applied: see \sQuote{Details}.In the case of functions like \code{+}, \code{\%*\%}, etc., the function name must be backquoted or quoted.}\item{\dots}{optional arguments to \code{FUN}.} } \details{\code{FUN} is found by a call to \code{\link{match.fun}} and typically is either a function or a symbol (e.g. a backquoted name) or a character string specifying a function to be searched for from the environment of the call to \code{apply}.
If \code{X} is not an array but an object of a class with a non-null \code{\link{dim}} value (such as a data frame), \code{apply} attempts to coerce it to an array via \code{as.matrix}if it is two-dimensional (e.g., a data frame) or via \code{as.array}.}\value{If each call to \code{FUN} returns a vector of length \code{n}, then \code{apply} returns an array of dimension \code{c(n, dim(X)[MARGIN])} if \code{n > 1}.If \code{n} equals \code{1}, \code{apply} returns a vector if \code{MARGIN} has length 1 and an array of dimension \code{dim(X)[MARGIN]} otherwise.If \code{n} is \code{0}, the result has length 0 but not necessarily the \sQuote{correct} dimension.
If the calls to \code{FUN} return vectors of different lengths, \code{apply} returns a list of length \code{prod(dim(X)[MARGIN])} with \code{dim} set to \code{MARGIN} if this has length greater than one.
In all cases the result is coerced by \code{\link{as.vector}}to one of the basic vector types before the dimensions are set, so that (for example) factor results will be coerced to a character array.}\author{Toby Dylan Hocking} MARGIN a vector giving the subscripts which the function will be applied over.E.g., for a matrix 1 indicates rows, 2 indicates columns, c(1, 2) indicates rows and columns.Where X has named dimnames, it can be a character vector selecting dimension names.
FUN the function to be applied: see 'Details'.In the case of functions like +, %*%, etc., the function name must be backquoted or quoted.

Details
FUN is found by a call to match.fun and typically is either a function or a symbol (e.g. a backquoted name) or a character string specifying a function to be searched for from the environment of the call to apply.
If X is not an array but an object of a class with a non-null dim value (such as a data frame), apply attempts to coerce it to an array via as.matrix if it is two-dimensional (e.g., a data frame) or via as.array.

Value
If each call to FUN returns a vector of length n, then apply returns an array of dimension c(n,dim(X)[MARGIN]) if n > 1.If n equals 1, apply returns a vector if MARGIN has length 1 and an array of dimension dim(X)[MARGIN] otherwise.If n is , the result has length 0 but not necessarily the 'correct' dimension.
If the calls to FUN return vectors of different lengths, apply returns a list of length prod(dim(X) [MARGIN]) with dim set to MARGIN if this has length greater than one.
In all cases the result is coerced by as.vector to one of the basic vector types before the dimensions are set, so that (for example) factor results will be coerced to a character array. Author(s) The ##end<< comment closes the return value documentation block.The examples are written using structure to put them in the ex attribute as the body of a function without arguments.This method for documenting examples was motivated by the desire to express examples in R code rather than in R comments, to keep the examples close to the object definition, and to avoid repetition of the object name.When examples are in R code, they are easily transferred to the R interpreter, and thus are easy to debug.Furthermore, when examples are written close to the object definition, it is easy to keep examples up to date and informative.

Table 1
Argument Description codeCharacter vector of all lines of R code in the package.env Environment in which the lines of code are evaluated.objs List of all R objects defined in the package.docs Documentation List from previous Parser Functions.desc 1-row matrix of DESCRIPTION metadata, as read by read.dcf.
are supplied by inlinedocs.

Table 1 :
Arguments that are passed to every Parser Function.