The RcmdrPlugin.survival Package: Extending the R Commander Interface to Survival Analysis

The R Commander graphical user interface to R is extensible via plug-in packages, which integrate seamlessly with the R Commander’s menu structure, data, and model handling. The paper describes the RcmdrPlugin.survival package, which makes many of the facilities of the survival package for R available through the R Commander, including Cox and parametric survival models. We explain the structure, capabilities, and limitations of this plug-in package and illustrate its use.


Introduction
This paper describes the RcmdrPlugin.survival package, which augments the Rcmdr ("R Commander") package (Fox 2005(Fox , 2007 to provide a graphical user interface (GUI) to many of the facilities of the survival package for R (Therneau 2012;Therneau and Grambsch 2000). The initial impetus for developing a survival-analysis plug-in for the R Commander came from a desire to introduce Brazilian medical researchers gently to the powerful facilities of the survival package (as discussed in Carvalho, Andreozzi, Codeço, Barbosa, Serrano, and Shimakura 2005). We anticipate that the capabilities of the RcmdrPlugin.survival package will grow, and hope that it may provide at least some researchers with a bridge to writing their own R commands.
After presenting a brief overview of the R Commander in Section 2 of the paper, meant to orient readers who have not previously encountered the Rcmdr package, we describe the use of the survival-analysis plug-in in Section 3. This section of the paper serves the dual function of furnishing a basic manual for the RcmdrPlugin.survival package and of establishing a basis for a discussion of the design and implementation of the RcmdrPlugin.survival package in Section 4. In Section 5 of the paper, we reflect on the challenges of designing a graphical interface for survival analysis and on the consequent limitations of the RcmdrPlugin.survival package. The Rcmdr and RcmdrPlugin.survival packages are both available on the Comprehensive R Archive Network (CRAN) at http://CRAN.R-project.org/.

A brief overview of the R Commander and plug-ins
The R Commander (Rcmdr) package (Fox 2005) was originally conceived as a basic-statistics graphical user interface to R (R Development Core Team 2012a), suitable for supporting a first course in statistics taught with a text such as Moore (2010). The Rcmdr quickly expanded to include facilities for fitting, checking, and displaying linear and generalized linear models, along with some other methods, such as exploratory factor analysis, that are not usually The toolbar directly below the menu bar shows the name of the "active data set"; includes buttons for editing and displaying the active data set; and shows the name of the active statistical model. Data sets in the Rcmdr are R data frames. Most commands invoked via the Rcmdr menus are applied to the active data set, and while several data sets may be present in memory, only one is active at any given time. The user can change the active data set either by reading a new data set into memory via menu items under the Data menu (e.g., Data → Import data → from text file, clipboard, or URL...) or by selecting a data set already in memory by pressing the active data set button in the toolbar. When a statistical modeling function, such as lm or glm, creates a new statistical model object, that object becomes the "active model," to which commands generated via the Models menu apply. Statistical models are associated with particular data sets, and if more than one model is associated with the currently active data set, the user may choose among the models via the active model button in the toolbar.

Using the RcmdrPlugin.survival package
As is typical of Rcmdr plug-ins, the RcmdrPlugin.survival package can either be loaded directly, by the command library("RcmdrPlugin.survival"), or, with the Rcmdr running, via Tools → Load Rcmdr plug-in(s)... . The first part of the package name, Rcmdr- Plugin., is conventional for an Rcmdr plug-in package, and insures that plug-ins sort directly after the Rcmdr on CRAN.
As illustrated in Figure 2, the RcmdrPlugin.survival package adds a variety of menus and menu items to the R Commander interface: A Survival data sub-menu is added to the Data menu, containing items for defining characteristics of survival data (such as time variables and an event indicator); for converting data sets from "wide" to "long" form, when data records may include more than one period of observation for each individual; and for converting character data to date objects.
The Statistics → Fit models menu acquires menu items for Cox regression models and for parametric survival models.
A Survival analysis sub-menu is added to the Statistics menu, with items for estimating and comparing survival functions.
A test for proportional hazards in the Cox model is added to Models → Numerical diagnostics.
Several new items are added to the Models → Graphs menu, to graph survival functions based on a fitted Cox model; to plot terms in a Cox model; to plot dfbeta and dfbetas (influence diagnostics) for a survival regression model; to plot Martingale residuals for a survival regression; and to produce partial-residual plots for a Cox model.
As we will illustrate presently, typical work-flow in analyzing survival data follows several steps: 1. We begin by reading the data via an appropriate selection in the Data menu.
2. We may transform the data via the Data → Active data set and Data → Manage variables in active data set menus, and if necessary convert character data to dates using Data → Survival data → Convert variable to date...

3.
We typically define time and event variables for the survival data set, and perhaps also define strata or clusters, with Data → Survival data → Survival data definition....

4.
We may change the format of the data set via Data → Survival data → Convert wide to long data....

Example: Brazilian hemodialysis data
As a more or less typical example of survival analysis, we employ, in simplified form, a data set analyzed by Carvalho, Henderson, Shimakura, and Sousa (2003), with data on all patients undergoing hemodialysis treatment in publicly funded clinics in Rio de Janeiro State, Brazil, between January 1998 (month 1 of the study) and August 2000 (month 44). The principal purpose of this and the following example is to illustrate how the RcmdrPlugin.survival package is used, not to undertake a serious analysis of the data, and we also assume general familiarity with methods of survival analysis (as described, e.g., in Therneau and Grambsch 2000). The Dialysis data set is included in the RcmdrPlugin.survival package, and comprises 6805 patients and seven variables: center, a numeric code indicating in which of 67 medical centers the patient was treated.
Although not strictly necessary, we find it convenient to convert this variable into a factor via Data → Manage variables in active data set → Convert numeric variables to factors..., selecting the option to use numbers as factor levels (dialog box not shown).
The age of the patient, in years at entry into the study.
begin, the month in which treatment began, coded 1 for patients already in treatment at the start of the study.
end, the month in which observation terminated, either because of death or censoring.
time under observation, that is end − begin. event, a numeric event indicator, coded 1 if the patient died while under observation or 0 if censored.
The disease causing kidney failure requiring dialysis, coded hypert (hypertension), congen (congenital), diabetes, renal, and other. As is the default in R, these five factor levels are initially ordered alphabetically, and are put in the order recorded here using Data → Manage variables in active data set → Reorder factor levels.... Although reordering the levels of disease isn't strictly necessary, hypertension is the most common diagnosis in the data set, and one with a moderate risk of death, so we find it convenient to select this as the baseline level for dummy-coded contrasts used in the statistical modeling reported below; we want other as the last level for esthetic reasons; and the remaining three levels are given in alphabetical order.
We read the data via Data → Data in packages → Read data from an attached package..., completing the resulting dialog, as shown in Figure 3. A brief overview of the data set is provided by selecting Statistics → Summaries → Active data set from the R Commander menus, producing the following results in the output window, which also displays the commands that were generated to read and reorganize the data: 1 > data(Dialysis, package="RcmdrPlugin.survival") > Dialysis$center <-as.factor(Dialysis$center) > Dialysis$disease <-factor(Dialysis$disease, levels=c('hypert', + 'congen','diabetes','renal','other')) > summary(Dialysis)  Before analyzing the Dialysis data, we must address the following issue: The natural time origin for patients in the study is the beginning of treatment, and for patients who entered the study between the 1st and 44th months, this is captured by the time variable in the data set. Some patients-so-called "prevalent cases"-began treatment before the beginning of the study, however. If we knew how many months of treatment had occurred before the beginning of the study, we could accommodate these patients by employing the countingprocess approach to survival analysis, but in the data set we are unable to distinguish between patients who entered treatment in month 1 of the study and those who were already being treated in month 1. Consequently, we employ the R Commander subset dialog to remove these cases, Data → Active data set → Subset active data set... (Figure 4), leaving us with 6665 of the original 6805 cases. 2 The RcmdrPlugin.survival package permits us to define persistent time and event variables, possibly along with strata and clusters, via Data → Survival data → Survival data definition..., producing the dialog box in Figure 5. The information is stored as attributes  of the Dialysis data frame. Where appropriate, these selections can also be made or modified in individual dialog boxes for estimating survival functions, fitting survival regression models, etcetera. We find it convenient in the current case to define time and event variables (literally the variables time and event).
We proceed to estimate the survival function for the data set as a whole, using Statistics → Survival analysis → Estimate survival function..., leading to the dialog box in Figure 6, and producing the following output, along with a graph of the estimated estimated survival function, shown in Figure 7: > .Survfit <-survfit(Surv(time, event)~1, conf.type="log", + conf.int=0.95, type="kaplan-meier", error="greenwood", + data=Dialysis) > summary(.Survfit) Call: survfit(formula = Surv(time, event)~1, data = Dialysis, conf.type = "log", conf.int = 0.95, type = "kaplan-meier", error = "greenwood") The widely spaced ellipses (. . .) in the output represent lines omitted for brevity. We knew that fewer than half the patients died during the study, and so requested estimates of quantiles above 0.5, including the 0.5 quantile to show that it is not estimable. The quantile method for survfit objects is not part of the survival package, but is supplied by RcmdrPlugin.survival.
We continue by comparing survival across levels of disease, via Statistics → Survival analysis → Compare survival functions..., leading to the dialog box in Figure 8, and producing the following output: > survdiff(Surv(time,event)~disease, rho=0, data=Dialysis) Call: survdiff(formula = Surv(time, event)~disease, data = Dialysis,   sult displayed in Figure 9. Survival is highest for those with congenital disease and lowest for those with diabetes.
Our next step is to fit a Cox model to the data, treating age and disease as covariates, Statistics → Fit models → Cox regression model..., which produces the dialog in Figure 10, and the following output:    37.99 4 1.127e-07 *** ---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Figure 11: Plots of Schoenfeld residuals for each covariate in the model. The dotted horizontal line is drawn in each panel at the estimated coefficient for the corresponding covariate; the dot-dashed line is a least-squares fit, corresponding to the test for proportional hazards; the solid line is a nonparametric-regression smooth, which can detect more general forms of nonproportional hazards; and the broken lines show a 95% pointwise confidence envelope around the smooth.
Thus, both the age and disease terms in the Cox model are highly statistically significant. 3 Having fit a Cox regression model to the Dialysis data, we would like to know whether the model adequately represents the data, and consequently we follow up with several diagnostics. Selecting Models → Numerical diagnostics → Test proportional hazards tests the proportional-hazards assumption of the Cox model and produces graphs of Schoenfeld residuals against time for each covariate ( Figure 11  There is, therefore, some evidence of non-proportional hazards, which we will not pursue here. 4 Continuing with the diagnostics, partial-residual plots, produced by Models → Graphs → Cox model partial-residual plots and shown in Figure 12, suggest that the functional form of the model is reasonable; and index plots of dfbetas, produced by Models → Graphs → Plot survival-regression dfbetas and shown in Figure 13, suggest that none of the observations are unduly influential, as is to be expected in a data set this large.
The functional form of the Cox regression can also be checked by plotting Martingale residuals from a null model against the regressors, obtained by Models → Graphs → Plot null Martingale residuals, and producing the graphs shown in Figure 14. The plots for the dummy variables are not informative, of course, but the plot for age reveals a monotone and apparently nonlinear relationship between the log-hazard and age, with the slope increasing with age. We could deal with this nonlinearity by transforming age or by using a low-degreeof-freedom regression spline in age.
Finally, we consider modeling differences among the centers as a random effect, by entering center as a frailty term in the model. To do this, we type the term frailty(center) directly into the model formula box in the Cox-model dialog, as shown in Figure 15, producing the following output: Call: coxph(formula = Surv(time, event)~age + disease + frailty(center), data = Dialysis, method = "efron")  The frailty term is highly statistically significant, but the regression coefficients and their standard error have not changed much.

Example: Recidivism data
To illustrate further the use of the RcmdrPlugin.survival package, we turn to data on criminal recidivism originally from Rossi, Berk, and Lenihan (1980) and analyzed extensively by Allison (1995) in a text on survival analysis using SAS. The data set, included in the RcmdrPlugin.survival package and read, as before, via the R Commander Data → Data in packages → Read data from an attached package... dialog, pertains to 432 convicts who were released from Maryland state prisons in the 1970s and followed up for one year after release. Half the released convicts were assigned at random to an experimental treatment in which they were given financial aid, and the other half did not receive aid. The Rossi data set includes the following variables: week of first arrest after release or censoring; all censored observations are censored at 52 weeks.
arrest, an event indicator, coded 1 if the released convict was rearrested and 0 otherwise. Figure 15: Cox regression dialog box, with the term frailty(center) typed directly into the model formula.
fin, a factor with levels yes if the convict received financial aid and no if he did not.
The convict's age in years at the time of release.
The convict's race, a factor with levels black and other.
wexp, the convict's full-time work experience prior to incarceration, a factor coded no or yes.
mar, the convict's marital status at the time of release, a factor coded married or not married.
paro, whether or not the convict was released on parole, a factored coded no or yes.
prio, the number of convictions prior to the current conviction.
emp1, employment status in the first week after release, a factor with levels no and yes. The data, therefore, are in "wide" format, with employment status recorded in the last 52 variables of each data record. To analyze the data with employment status as a time-varying covariate, we have to reorient the data into "long" format, with one record in counting-process form for each week during which a convict remains at risk of rearrest. As well, we would like to lag employment status by one week; otherwise, the effect of employment status will likely be exaggerated, because a convict is unable to work at the end of the week in which he is rearrested.
We can accomplish these data-management tasks with Data → Survival data → Convert wide to long data..., producing the dialog box shown in Figure 16. In this dialog, we selected the time variable week and event indicator arrest; picked the time-dependent covariates emp1 through emp52 in the list box at the lower left; typed in the name emp for the resulting time-varying covariate; set the lag slider to 1; and are about to press the Select button to define the time-dependent covariate. Were there more than one time-varying covariate, we would repeat these operations. Finally, we press the OK button to create the new data set, which is named Rossi.long by default, and which becomes the active data set in the R Commander. Although it is moderately complex and can handle many wide-to-long conversions, the reshape dialog is also substantially limited in its capabilities-for example, it can only deal with covariates that are measured at each point in time at the same level of precision as the time-to-event variable. The unfold function that the dialog calls is somewhat more flexible. We return to this and other limitations of the RcmdrPlugin.survival GUI in Section 5 of the paper.
While the original data set had 432 rows and 62 variables, the new data set has 19,377 rows and 14 variables, one row for each period of observation. The new data set has variables start and stop for the beginning and end of each period of observation, and a new event indicator arrest.time, coded 1 in the week in which a convict is rearrested and 0 otherwise. Time-constant covariates are simply copied into each record pertaining to a given convict. The first few rows of the data set, with data for the first two convicts, look like this: Thus, convict 1 was rearrested in the 20th week and convict 2 in the 17th week following his release.
We proceed to fit a Cox model to the recidivism data (via the dialog in Figure 17), producing the following results: The covariates age, employment status, and number of prior arrests have statistically significant coefficients; the coefficient of financial aid, the focus of the study, is just statistically significant by a one-sided test (halving the reported two-sided p value of .085).
The RcmdrPlugin.survival package offers several options for graphing the results of a fitted Cox model. We show one approach here, using Models → Graphs → Cox-model survival function..., which leads to the dialog in Figure 18. We select Plot at specified values of predictors, set the slider to 2 rows, and fill in the values with the medians of age, education, and the number of prior arrests, and with the most common levels for the factors: not married for marital status, no for employment status, black for race, and yes for parole and work experience. We set financial aid to no and yes in turn. The resulting graph is displayed in Figure 19.

The design of the RcmdrPlugin.survival package
Rcmdr plug-in packages (Fox 2007) are ordinary R packages (see R Development Core Team 2012b) with some extra features. Plug-in packages consist of several components.
The DESCRIPTION file for the RcmdrPlugin.survival package includes the line Models: coxph, survreg, coxph.penal to instruct the R Commander that coxph, survreg, and coxph.penal objects represent statistical models, manipulable through the R Commander Models menu.
The file menus.txt, included in the inst/etc directory of the source package, and shown in slightly edited form in Figure 20, contains directives for adding to the R Commander menus.
There are up to seven fields, separated by white space, in each line of the menus.txt file: The first field in each directive indicates what type of entry-menu or menu item-is defined by the directive; it is also possible to specify remove to delete an existing menu or item.
The second field supplies a name for a new menu or the name of the menu under which an item is to be installed; in either case this must be a valid R variable name. For   The third field specifies the name of the parent menu for a newly defined submenu (such as statisticsMenu for survivalMenu), or one of two operations: command for a menu item (as in the second directive in Figure 20), or cascade to place a submenu under its parent.
The remaining fields apply to item directives. The fourth field specifies a label, as a character string enclosed in quotes, for the menu or menu item. By convention, a menu item that leads to a dialog box ends in ... .
The fifth field specifies the parent menu for a cascade operation, or a callback function for a menu item-the name of an R function to be invoked when the menu item is selected by the user.
The sixth field specifies an optional R command to be executed to determine whether a menu item is activated or, alternatively, "grayed out." This expression, enclosed in quotes, should return the logical value TRUE or FALSE. The Rcmdr package provides a variety of "predicate" functions for testing commonly used conditions for menu-item activation; for example, the function activeDataSetP returns TRUE if there is an active data set and FALSE otherwise.
The seventh, and final, field is also optional-a quoted R command that evaluates to TRUE or FALSE, and determines whether a menu or menu item is installed; for example, the command packageAvailable('survival') is TRUE if the survival package is installed on the user's system and FALSE otherwise.
Missing fields in the menu.txt file are specified with the place-holder "", and may be omitted if they occur at the end of a line. At startup, the Rcmdr processes menu directives sequentially, and so, for example, a menu (such as survivalMenu) must be defined before menu items can be installed in it.
in the survival package (see Figure 8 on page 11, along with the associated output). The Survdiff function makes use of several Rcmdr-supplied utilities, including: initializeDialog to set up the dialog box; getSelection to return the selection from a list box; errorCondition to signal a user error; trim.blanks to remove leading and trailing blanks from a text string; doItAndPrint to execute the command generated from the dialog; OKCancelHelp to create the standard set of Rcmdr buttons at the bottom of the dialog; Factors to ascertain the names of factors (i.e., categorical variables) in the active data set; variableListBox to create a variable list box; getFrame to return the enclosing frame of a variable list box; labelRcmdr to create a standard Rcmdr label widget; dialogSuffix to perform a variety of housekeeping tasks to complete the dialog; and the getDialog and putDialog functions to permit the dialog to "remember" its state from one invocation to the next, a new facility introduced in version 1.7-0 of the Rcmdr. 5 The defaults provided to getDialog are used if previous values haven't been stored, and dialog state is refreshed when the data set changes or when the Refresh button is pressed in a dialog.
Other commands used in the dialog are provided by the RcmdrPlugin.survival package: when there are two time variables selected, startStop determines which is the start and which the end of a period of observation; numericOrDate returns the names of numeric and date variables in the active data set.
Several commands are provided directly by the tcltk package: tclvalue, tkfocus, tkframe, tclVar, tkscale, and tkgrid. Finally, the gettext function, which is part of base R, makes provision for messages and other text in the RcmdrPlugin.survival package to be translated from English into other languages, a feature shared with Rcmdr package. Currently, four translations are available for the RcmdrPlugin.survival package, into Brazilian Portuguese, French, Russian, and Spanish.
The RcmdrPlugin.survival package also provides some facilities that could have been included in the survival package but were not, such a plot method for coxph objects and a quantile method for survfit objects.

Concluding remarks: Design challenges and limitations
It is difficult, and probably ill-advised, to expose all of the facilities of complex statistical software such as the survival package through a GUI: A complete GUI for the survival package would likely be hard to use. On the other hand, to the extent that the work-flow in a statistical analysis can be decomposed into semi-independent operations of manageable complexity, it should be possible to provide menus and dialog boxes to perform these operations. This is essentially the approach that is taken in the R Commander, and that approach is inherited by the survival-analysis plug-in described in this paper.
Although it might at first blush seem counter-intuitive, we believe that the object-orientation of R, and in particular of the statistical modeling functions in R, facilitates the development of GUIs, because the encapsulation of a statistical model, for example, in an R object allows us to modularize the process of data analysis, providing menu items and dialogs for standard manipulation of the model. This approach also fits well with the essentially iterative character of statistical modeling-as, of course, does the canonical command-driven approach in R. Contrast this with software in which all operations related to a statistical model must be specified in a single dialog or command, typically producing voluminous output. Similarly, the fact that R data frames are modifiable objects allows us to add information, subsequently available to the GUI, that is specifically relevant to survival analysis-such as the names of time and event variables in the data set.
To the extent that the work-flow of data analysis isn't easily broken into relatively simple steps, however, the design of a GUI becomes challenging. We believe that this is particularly true of data manipulation, where it is difficult in a GUI to make provision for all eventualities that commonly arise, and where many data sets have idiosyncratic features. It is, therefore, often much more natural to manipulate data by writing simple programs than through a GUI, and the R user who takes the time to master simple programming is consequently at a distinct advantage.
A case in point is the dialog provided by the RcmdrPlugin.survival package for reshaping data from wide to long form, discussed briefly in Section 3. As we explained, this dialog is capable of handling wide data sets with a regular structure in which time-varying covariates are measured at equally spaced intervals of the same precision as the time-to-event variable in the data set. The unfold function that the dialog invokes can handle more complex time-varying covariate structures, and other facilities in R, such as the standard reshape function and the reshape package (Wickham 2007) provide still greater flexibility. Writing a more general GUI for any of these facilities would be challenging and quite likely of dubious value. Indeed, given the requisite programming skill, it is often simpler to write a quickand-dirty function for manipulating data than to use a general function. Notwithstanding these general caveats, however, we hope to improve the survival-data handling facilities of the RcmdrPlugin.survival as the opportunity to do so arises.
The R Commander plug-in architecture has the advantage of embedding a GUI for a particular area of data analysis, such as survival analysis, in a more general framework, providing, for example, substantial infrastructure for reading, manipulating, and summarizing data. The R Commander also provides some general facilities to the plug-in developer for writing menus and dialogs. Of course, the resulting plug-in must conform to the general structure of the R Commander, which is oriented towards the step-by-step analysis of one rectangular data set at a time.
We must confess to having mixed feelings about GUIs for R, and for statistical software more generally. We believe that most serious users of R would be better served by learning to write commands. Nevertheless, GUIs such as the R Commander do have a legitimate role: for use in basic and perhaps intermediate-level statistics courses, where teaching complex computational tools such as R can distract from the principal focus on statistical ideas; among casual users of statistical software, whose memory of commands would otherwise be taxed; and in other special circumstances where ease of use is critical.