reporttools : R Functions to Generate L A TEX Tables of Descriptive Statistics

In statistical analysis reports, tables with descriptive statistics are routinely presented. We introduce the R package reporttools containing functions to eﬃciently generate such tables when compiling statistical analyses reports by combining L A TEX and R via Sweave .


Introduction
In statistical analysis reports and medical publications it is common practice to start statistical analyses with displays of descriptive statistics of patient characteristics and further important variables. The purpose of these tables is (1) to get an idea about basic features of the data and (2) especially in analysis reports, data checking. Tables of descriptive statistics may take different formats, depending on the type of variables to be displayed, which descriptive statistics are to be reported, and whether statistics should be given for all observations jointly or separately for the levels of a given factor, such as e.g. treatment arm. To be able to efficiently generate these recurring parts of analyses when combining L A T E X (Knuth 1984;Lamport 1994) with R (R Development Core Team 2009) code via Sweave (Leisch 2002), the R package reporttools provides functions to generate L A T E X tables of descriptive statistics for nominal, date, and continuous variables. The tables are set up as data frames in R, then translated into L A T E X code using the standard R package xtable (Dahl 2009). Using Sweave, these tables can be directly generated in L A T E X documents by invoking basically only one line of R code. age, each variable gets a separate section providing descriptive statistics and corresponding plots, depending on the type of variable that is analyzed. The default functions in r2lUniv directly generate an entire L A T E X document, thereby reporting the descriptive statistics of only about three variables on one single page. Set up this way, it seems difficult to efficiently merge plain text and data analysis using r2lUniv. Additionally, we are not aware of a possibility in r2lUniv to compare a given variable between different groups in an easy way. The functions introduced in reporttools aim at closing these gaps. The tabulating functions in reporttools can be applied inside a .Rnw document combining L A T E X plain text and data analyses in R.
A common feature in data analysis is, that one needs to provide descriptive statistics of a large set of variables thereby generating tables that are larger than one page. The functions described here have by default the tabular.environment-option in the implicitly used R function print.xtable set to longtable. Together with the L A T E X package longtable, setting this option generates tables that may range over more than a page, without additional specification or programming effort.
The package reporttools contains several additional functions useful in setting up analyses and writing reports. However, the emphasis in this article is on the tabulating functions. For a more detailed description of further elements of the package we refer to its R help files.
In Section 3 we introduce the dataset that is used to illustrate the tabulating functions in Section 4. Some conclusions are drawn in Section 5.

Stanford heart transplantation data
We illustrate our new functions using the Stanford heart transplantation dataset jasa in the standard R package survival (Therneau and Lumley 2009). This dataset provides some patient characteristics and the survival of patients on the waiting list for the Stanford heart transplantation program, see Crowley and Hu (1977) and the corresponding R help file for details.

Descriptive statistics for heart transplantation data
To demonstrate the entire flexibility of reporttools we use the package to describe the Stanford heart transplantation data.

Nominal variables (factors)
The following code lines invoke reporttools and prepare the variables from jasa for later use.
Furthermore, by specifying either print.pval = "fisher" or print.pval = "chi2" we can ask for a p-value of either a Fisher's exact or a χ 2 test comparing the distributions between the groups defined by group. If interested in frequencies for patients not older than 50 years, we can set the option subset = (Age <= 50). Then, for the HLA A2 score, we consider missing values a category in computation of percentages. To this end, we assign miss.cat a vector containing the indices of the factor(s) in vars we want the percentages to be computed this way. All these options are implemented in Table 2   It may happen that one prefers a different ordering or naming of the levels of a factor in a table.
For example, suppose that in Table 1 frequencies of the patients that died should precede those who are alive. We suggest to reverse this ordering via re-coding of the corresponding factor variable, by appropriately specifying the options levels and labels in R's command factor.
Recoding factors is also recommended if one wants to change factor labels in a Table. By default, vertical lines are added to the plot. These can be omitted by specifying vertical = FALSE in the function call, as is done in the call for Table 1.
Finally, a vector that attaches a weight to each observation can be assigned to the option weights.

Date variables
The primary purpose to report descriptive statistics for date variables is data checking. The implemented function tableDate provides number of observations (n), minimum (Min), first quartile (q 1 ), median ( x), mean (x), third quartile (q 3 ), Maximum (Max), and number of missing values (#NA). If not all of these statistics need to be reported, the desired columns can be specified via stats. Simply provide a sub-vector of c("n", "min", "q1", "median", "mean", "q3", "max", "na") giving the statistics you want to be displayed, in the order they should appear in the table. As an illustration, we only display n, Min, Max, and #NA for the variables birth date, acceptance date, and end of follow up date of the heart trial in Table 3, via the following code: R> vars3 <-vars0[, c("Birthday", "Acceptance into program", + "End of follow up")] R> cap3 <-"Patient characteristics: date variables, by transplantation + status." R> tableDate(vars = vars3, group = Transplantation, stats = + c("n", "min", "max", "na"), print.pval = TRUE, cap = cap3, lab = + "tab: date1", longtable = FALSE) As for tableNominal, the vars argument of the function is a data frame containing the variables of interest, but here each of R class Date. One can ask for a p-value of a Mann-Whitney test (or Kruskal-Wallis test, depending on the number of levels of the grouping variable), to assess whether distributions are different between the groups defined by a given factor. This is especially relevant when analyzing data of a randomized trial, to verify whether patient characteristics are indeed equally distributed between randomized treatment arms.

Continuous variables
Compared to tableDate, the function tableContinuous to display continuous variables additionally provides standard deviation (s) and interquartile range (IQR) as default statistics. When using tableContinuous, all variables in the data frame given as the vars argument to this function must be numeric. Again, via stats one can choose which statistics to display, out of the options c("n", "min", "q1", "median", "mean", "q3", "max", "s", "iqr", "na"). To provide even more flexibility in the choice of descriptive statistics, user-defined functions can be supplied to stats in tableContinuous. Such a user-defined function must take a vector as an argument and return a single number (the desired statistic). Missing values are removed by default. For illustration, we add a trimmed mean and the coefficient of variation c v to the chosen default statistics in Table 5. The number of decimal places the descriptive statistics are displayed with can be set using prec. Note that this option does not affect the columns n and #NA, their entries are always given as whole numbers. Finally, by specifying print.pval one can ask for a p-value of an F , t, Kruskal-Wallis, or Mann-Whitney test, as appropriate, to compare the variable under consideration between the groups given by group.

Conclusions
In this article, we present some functions in the reporttools package for R that facilitate the presentation of descriptive statistics of nominal, date, and continuous variables when writing reports using Sweave. The package is available from CRAN.

Acknowledgments
I thank the editor and two referees for constructive comments that led to an improvement of reporttools and the presentation of the paper. I also thank Leo Held and Philipp Muri for helpful discussions.