Building Bivariate Tables: The compareGroups Package for R

The R package compareGroups provides functions meant to facilitate the construction of bivariate tables (descriptives of several variables for comparison between groups) and generates reports in several formats (L A TEX, HTML or plain text CSV). Moreover, bivariate tables can be viewed directly on the R console in a nice format. A graphical user interface (GUI) has been implemented to build the bivariate tables more easily for those users who are not familiar with the R software. Some new functions and methods have been incorporated in the newest version of the compareGroups package (version 1.x) to deal with time-to-event variables, stratifying tables, merging several tables, and revising the statistical methods used. The GUI interface also has been improved, making it much easier and more intuitive to set the inputs for building the bivariate tables. The ﬁrst version (version 0.x) and this version were presented at the 2010 use R ! conference (Sanz, Subirana, and Vila 2010) and the 2011 use R ! conference (Sanz, Subirana, and Vila 2011), respectively. Package compareGroups is available from the Comprehensive R Archive Network at http://CRAN.R-project.org/package=compareGroups .


Introduction
In many studies, especially epidemiological ones, it is important to compare characteristics between independent groups of individuals. Usually, these comparisons are presented in the form of tables of descriptive statistics where rows are characteristics and each column is a group. Tables of this form are usually called bivariate tables. For example, a bivariate table might compare treated and untreated patients (column-variable) in terms of age, history of hypertension, triglyceride levels, etc. (row-variables). Usually the number of row-variables is quite large, and thus construction of the bivariate table is laborious and time-consuming. And if, as often happens, the results must be presented stratified by sex, for example, the process is even more laborious and repetitive. For these reasons, we have implemented a package called compareGroups (Subirana, Vila, Sanz, Lucas, Peñafiel, and Giménez 2014) in the R software (R Core Team 2013) which quickly and efficiently generates bivariate tables in several different formats (plain text, HTML or L A T E X).
Depending on the nature of the variables, different statistics (means, standard deviation, medians and many others) are computed properly. It is also possible to compute odds ratios when assessing univariate association between several variables and a binary response such as case-control status, or to perform survival analysis (hazard ratios, log-rank p values, etc.) when dealing with a cohort study.
The compareGroups package does not incorporate new functions. Instead it uses several existing functions from different R packages in order to avoid the necessity for the user to search them "manually". This saves a lot of time.
The compareGroups package contains classes, methods and generic functions (some of them well known to R users, such as print, plot or summary) meant to make the functions as simple and easy to use as possible. Nevertheless, there are a lot of arguments that may be changed in order to modify the reported table: number of decimals, display absolute or relative frequencies, to display or not the number of data available, etc.
For those users who are not familiar with the R syntax, a graphical user interface (GUI) has been implemented. Using the GUI, it is possible to build bivariate tables without typing on an R console, but just clicking and dragging on a single frame. This GUI frame has been built using functions from the packages tcltk and tcltk2 (Grosjean 2013b,a). It contains a main menu from which the user can choose the data to load, select the desired format to export the bivariate table, etc. In the same GUI frame, the user can specify the variables to describe and many other options very intuitively.
In order to illustrate how compareGroups works, an example data set is provided in the package. This data set is taken from a real study (http://www.regicor.org/) with a subset of individuals and variables of different types: continuous, normal-distributed, categorical, binary and time-to-event. In addition, an exhaustive user manual in form of a vignette has been included in the package which contains a lot of R instructions showing how to specify compareGroups functions, arguments and methods to build the desired bivariate table, with detailed examples. Table 1 lists the R functions used to build the compareGroups package, showing the wide variety of standard R functions and options compiled to configure the package, so the user does not have to worry about finding the appropriate function on each occasion.

Constructing the bivariate table
All descriptives and p values must be tabulated in order to be read easily and clearly. There are some standard formats for these bivariate tables: for normal distributed variables, means and  Table 1: R functions used in the compareGroups package. Table legends are the following: (a) family = binomial; (b) method = "BH"; (c) method = "pearson"; (d) method = "spearman"; (e) method = "pearson" for which the p value is computed as P χ 2 1 > r 2 x,y · (n − 1) , where x and y are the row-variable and the response converted to numeric, r x,y is the Pearson correlation (cor) between x and y and n is the number of available data. standard deviations inside round brackets are displayed; for non-normal distributed variables, median and quartiles between squared brackets; for categorical variables one may choose to display both absolute and relative frequencies (inside round brackets) or only relative frequencies. Also, it may be useful to change the number of significant decimals. These and other table aspects can be modified very easily by changing the default values of compareGroups package functions.
When several bivariate tables are needed for different subsets of participants (e.g., males and females), it is possible to display them one beside the other. Very standard generic functions, cbind (or rbind when adding more variables to the table) have been implemented to do so. See the vignette available in the package for more details.
Moreover, it may be very informative, for internal use, to display the number of available data (individuals) for each row-variable and group. Or to display the selection criteria for each row-variable, etc. Once the "bivariate object table" has been created, this can be done just by typing: summary.
In the next section we list the different formats in which the bivariate table as well as the "informative table" derived from applying summary can be exported.

Reporting the bivariate table
The compareGroups package exports and displays the resulting bivariate tables in different formats: L A T E X, HTML and plain text CSV. In addition, the tables can be printed directly on the R console in a nice format.
A set of functions have been implemented to export the table externally to a .tex (L A T E X) document, HTML and .csv (plain text). Each function incorporates arguments to change some options like specifying the file name, the character to separate columns when exporting to CSV format, to display or not the number of individuals in each group, etc.
In addition, when exporting to L A T E X it is possible to specify the caption, or to change the font size. Tables in L A T E X are exported under the longtable environment. Multicolumns and multirows are also used when it is necessary to make the table more attractive. See Table 2 as an example of a sex-stratified bivariate table exported to L A T E X. Figure 1 shows an example of a bivariate table exported to HTML.   Figures 2 and 3 show the aspect of printing a bivariate table and its corresponding "informative  table" with available data, respectively, on the R console.

Plotting
The compareGroups package is able to show graphically the distribution of the analyzed variables, both row-variables and response variable.
Using the generic plot function, different plots are performed for each variable according to its nature. In addition, the user can select between two types of plots, to display only row-variables (univariate plots) or show the relationship between each row-variable and the response variable (bivariate plots).
Although plotting is not the main goal of the compareGroups package, it may be very useful to check whether a continuous variable follows a normal distribution or not, to quickly visualize the number of data contained in each group of a categorical variable or to compare the  incidence of a time-to-event variable through a Kaplan-Meier plot, etc. Other R packages will be more useful to perform reports with graphs meant to visually describe data contained in a data frame and maybe to do a more exhaustive data quality control, such as for example r2lh (Genolini, Desgraupes, and Franca 2011). In Figure 4 there are examples of these plots using variables from an example data set included in the package.

Classes and methods
The compareGroups package has been structured as any other standard R package with methods, functions and classes. They are organized sequentially as follows: In a first step, a function with the same name as the package, compareGroups, does all the calculations: descriptives, odds ratios, hazard ratios, p values, etc.    In each step, objects of different classes are created. These objects can be printed, summarized, subsetted or even plotted (this last option is available only for objects created in the first step). Subsetting is done as usual, using [ brackets, and it may be very useful when the user wants to display only a subset of analyzed variables from an already built bivariate table. The update generic function is useful when a bivariate table is created to be used as a "template" for a set of different tables, followed by just changing a few things such as the response variable subsequently (see Section 3.1).
Figure 5 represents a scheme of the different functions, methods and classes of the created objects implemented in the compareGroups package. Step 1: Computations Step 2: Construction Step 3

Using the package
In this section, an example will be shown to illustrate how to use the compareGroups functions and tools. More concretely, we will list the commands to perform descriptives of the variables proceeding from the regicor data set by year (1995, 2000 and 2005). For all continuous variables, mean and standard deviation will be computed except for triglycerides, physical activity, physical component and mental component, which will be treated as non-normal and consequently median, first and third quartiles will be calculated. In addition, only nontreated individuals will be selected in performing the analysis on cholesterol variables, i.e., total cholesterol, HDL-cholesterol, LDL-cholesterol and triglycerides.
In addition, we will show how easy it is to change the response variable; in this example, to perform descriptives by sex instead of year.
R> export2latex(restab, loc = "bottom", + caption = "Descriptives by year.", size = "small") Step 4: If we want to perform descriptives of the same variables by sex instead of year, we can take advantage of the generic R udpate function to avoid having to specify the table format again (see Table 6): R> export2latex(update(restab, x = update(res, sex~. -sex)), + loc = "bottom", caption = "Descriptives by sex.")  Tables can be exported to plain text or to HTML with export2csv and export2html functions, respectively.
Extensively detailed examples on how to construct, print and export tables can be found in the vignette contained in the package.

By GUI
A GUI based on packages tcltk and tcltk2 has been implemented. It is possible to read data from files of different formats (from SPSS, plain text, .RData). Also, the user can load a data frame already existing in the R workspace. This can be useful when the user has worked previously on a data set, recoding and preparing its variables.
To open the GUI, call the cGroupsGUI function that has the data frame as the argument. It is very easy to customize the bivariate table as desired, specifying a lot of options (number of digits, to report means or medians, absolute or relative frequencies, etc.), as well as to select the format in which the bivariate table must be exported. The GUI has been designed in a single frame, making it much more comfortable to select and to change all the options.