Scatterplot3d – an R package for Visualizing Multivariate Data

Scatterplot3d is an R package for the visualization of multivariate data in a three dimensional space. R is a “language for data analysis and graphics”. In this paper we discuss the features of the package. It is designed by exclusively making use of already existing functions of R and its graphics system and thus shows the extensibility of the R graphics system. Additionally some examples on generated and real world data are provided.


Introduction
Scatterplot3d is an R package for the visualization of multivariate data in a three dimensional space. R itself is a "language for data analysis and graphics" (Ihaka and Gentleman, 1996) and an open source and freely available statistical software package implementing that language, see http://www.R-project.org/. Historically, R is an implementation of the award-winning S language and system (Becker and Chambers, 1984;Becker et al., 1988;Chambers and Hastie, 1992;Chambers, 1998). Note that scatterplot3d could be ported to the commercial S implementation S-PLUS. Minor adaptions would be needed since S-PLUS lacks some of R's par() arguments (such as col.axis) and xyz.coords() whereas major coding would be needed to substitute for R's closures, see subsection 3.2.
Basically scatterplot3d generates a scatter plot in the 3D space using a parallel projection. Higher dimensions (fourth, fifth, etc.) of the data can be visualized to some extent using, e.g. different colors, symbol types or symbol sizes.
The following properties of scatterplot3d will be further described and discussed in the present paper: A plot is generated entirely by using interpreted R graphics functions, so the appearance of the plot is consistent with other R graphics. Such a behavior is important for publications. Most features of the R graphics system can be applied in scatterplot3d, among them are vectorizing of colors or plotting symbols and mathematical annotation (Murrell and Ihaka, 2000). The latter means whole formulas with e.g. greek letters and mathematical symbols inside can be added into plots using a L A T E X like syntax. Scatterplot3d can be easily extended e.g., by adding additional points or drawing regression lines or planes into an already generated plot (via function closures, see below). The package is platform independent and can easily be installed, because it only requires an installed version of R.
This paper is structured as follows: In Section 2 the design of scatterplot3d will be described, followed by remarks on the extensibility of the function in Section 3. Some examples (including code and results) on generated and real world data are provided in Section 4. We present other R related 3D "tools" in Section 5, followed by the conclusion in Section 6.
R and scatterplot3d are available from CRAN (Common R Archive Network), i.e., http: //CRAN.R-Project.org or one of its mirrors.

Design
Scatterplot3d is designed to plot three dimensional point clouds by exclusive usage of functions in the R base package. Advantages of this "R code only" design are the well known generality and extensibility of the R graphics system, the similar behavior of arguments and the similar look and feel with respect to common R graphics, as well as the quality of the graphics, which is extremely important for publications. Drawbacks are the lack of interactivity, and the missing 3D support (2D design).
While the function persp for plotting surfaces (cf. Section 5) applies a perspective projection, in scatterplot3d a parallel projection for a better comparison of distances between different points is used. Perspective projection is not (yet) available in scatterplot3d.
The final implementation of the function and the building of the package was done according to the "R Language definition" and "Writing R Extensions" manuals of the R Development Core Team (in short, 'R core'), 2003b and 2003c.

Arguments
The scatterplot3d function has been designed to accept as many common arguments to R graphics functions as possible, particularly those mentioned in the help pages of the function par and plot.default (R core, 2003a). In principle, arguments of par with a particular 2D design are replaced by new arguments in scatterplot3d. Regularly, values of the corresponding arguments in par for the first two dimensions are read out, and scatterplot3d either "guesses" the value for the third dimension or has an appropriate default.
A few graphical parameters can only be set as arguments in scatterplot3d but not in par. For details on which arguments have got a non common default with respect to other R graphics functions, see the "Usage" and "Arguments" sections of the help page (Section 2.2). One of these arguments is mar, which is used to specify the lines of margin to the four sides of a plot. It is also non-standard in another way: Users who want to extend an existing scatterplot3d graphic with another function than points3d, plane3d or box3d (cf. Section 3.2), should consider to set par(mar = c(b, l, t, r)) to the value of mar used in scatterplot3d, which defaults to c(5, 3, 4, 3) + 0.1, because R has to know about it for correct clipping of the figure region.
Other arguments of par may be split into several arguments in scatterplot3d, e.g. for specification of the line type. Finally, some of the arguments in par do not work, e.g. some of those for axis calculation. As common in R, additional arguments that are not mentioned on the help page can be passed through to underlying low level graphics functions by making use of the general '...' argument.

Help page -first part of help(scatterplot3d)
scatterplot3d

Description
Plots a three dimensional (3D) point cloud.

Arguments
x the coordinates of points in the plot.
y the y coordinates of points in the plot, optional if x is an appropriate structure.
z the z coordinates of points in the plot, optional if x is an appropriate structure.
color colors of points in the plot, optional if x is an appropriate structure. Will be ignored if highlight.3d = TRUE.
main an overall title for the plot.
sub sub-title. xlim, ylim, zlim the x, y and z limits (min, max) of the plot. xlab, ylab, zlab titles for the x, y and z axis.
scale.y scale of y axis related to x-and z axis.
angle angle between x and y axis (Attention: result depends on scaling. For 180 < angle < 360 the returned functions xyz.convert and points3d will not work properly. x.ticklabs, y.ticklabs, z.ticklabs vector of tick mark labels.
y.margin.add add additional space between tickmark labels and axis label of the y axis grid a logical value indicating whether a grid should be drawn on the plot.
box a logical value indicating whether a box should be drawn around the plot.
lab a numerical vector of the form c(x, y, len). The values of x and y give the (approximate) number of tickmarks on the x and y axes.
lab.z the same as lab, but for z axis.
type character indicating the type of plot: "p" for points, "l" for lines, "h" for vertical lines to x-y-plane, etc.
highlight.3d points will be drawn in different colors related to y coordinates (only if type = "p" or type = "h", else color will be used).
On some devices not all colors can be displayed. In this case try the postscript device or use highlight.3d = FALSE.
mar A numerical vector of the form c(bottom, left, top, right) which gives the lines of margin to be specified on the four sides of the plot. col.axis, col.grid, col.lab the color to be used for axis / grid / axis labels. cex.symbols, cex.axis, cex.lab the magnification to be used for point symbols, axis annotation, labels relative to the current. font.axis, font.lab the font to be used for axis annotation / labels. lty.axis, lty.grid the line type to be used for axis / grid.
... more graphical parameters can be given as arguments, pch = 16 or pch = 20 may be nice.
The remaining paragraphs, Value, Note, Author, See Also and Examples are left off here, since all their information is part of this text.

xyz.coords()
As well known from other R functions, vectors x, y and z (for the 3D case) are used to specify the locations of points.
If x has got an appropriate structure, it can be provided as a single argument. In this case, an attempt has to be made to interpret the argument in a way suitable for plotting. For that purpose, we added the function xyz.coords (R core, 2003a) into the R base package that accept various combinations of x and optionally y and z arguments. It is a "utility for obtaining consistent x, y and z coordinates and labels for three dimensional plots" (R core, 2003a). Many ideas used in this function are taken from the function xy.coords already existing for the 2D case. Even though xyz.coords was introduced to support scatterplot3d, it is designed to be used by any 3D plot functions making use of (x i , y i , z i ) triples 1 .
If the argument is a formula of type zvar~xvar + yvar (cf. R core (2003b) for details on formulas), xvar, yvar and zvar are used as x, y and z variables. If the argument is a list with components x, y and z, these are assumed to define plotting coordinates. If it is a matrix with three columns, the first is assumed to contain the x values, etc. Alternatively, two arguments x and y can be provided, one may be real, the other complex. In any other case, the arguments are coerced to vectors and the values plotted against their indices. If axis labels xlab, ylab and zlab are not given explicitly, xyz.coords attempts to extract them from the above mentioned data structures. Additionally, color vectors contained in a matrix, data frame or list can be detected by scatterplot3d internally.

Structure
The R code of scatterplot3d is structured into a few parts: A quite long list of arguments in the first part of the function is followed by some plausibility checks, extraction of characters, conversion of data structures (cf. Section 2.3), basic calculations of the angle for displaying the cube, and calculations regarding the data region limits, as well as data sorting for an optional "3D highlighting" feature.
In order to optimize the fit of the data into the plotting region, the second part of the function deals with optimal scaling of the three axes. This yields a high printout quality as well known from regular R graphics, but unfortunately it results also in a static plot, i.e., rotation is not possible. If scatterplot3d's with different viewing angles are put together as a "slide show" to imitate a rotation, each of these "slides" is individually optimally sized regarding the plotting region, so all in all such a "slide show" will not work.
After the graphics device is initialized in the third part, axis, tick marks, box, grid and labels are added to the plot, if it is required. In the last but one part, the data is plotted and overlayed by the front edges of the box.
Besides the main result, a drawn plot, four functions are generated and invisibly returned as Values in the last part of scatterplot3d. These functions, namely xyz.convert, points3d, plane3d and box3d, are required to provide extensibility of the three dimensional plot; details are described in Section 3.

Extensibility
Two kinds of extensibilities will be described in this section. On one hand, regarding the scatterplot3d design, the extensibility of the R graphics system will be discussed; it provides the tools and features enabling the programmer to write complex high level plot functions in a very general manner. On the other hand we describe the extensibility of scatterplot3d itself.

Extensibility of R graphics
R provides a huge collection of low level graphic functions like those for adding elements to an existing plot or for computations related to plotting. These functions are used to build very general high level functions, at least for the two dimensional case, and without them, the "R code only" design of scatterplot3d would have been impossible.
A selection of these low level functions begins with the functions to obtain x, y (and z) coordinates for plotting, namely xy.coords and xyz.coords (for the 3D case, cf. Section 2.3). Further on, the functions plot.new and plot.window can be used to set up the plotting region appropriately, pretty to calculate pretty axis tick marks, segments to draw line segments between pairs of points, and functions like title, axis, points, lines, text etc. are self-explanatory.
A huge collection of graphical parameters for R is documented in the help pages for par and plot.default (cf. R core, 2003a). Almost all low level graphic functions make use of the argument '...' which allows specifying most of these parameters in a very general manner. If this argument, '...', is also used in a high level function, arguments which are not explicitly introduced in the arguments list, can be passed through to lower level graphic functions as well; this is a powerful feature of the S language.
Since the R graphics system is designed for two dimensional graphics, it lacks some features for the three dimensional case. Unfortunately, the axis function works only for 2D graphics. Consequently a large amount of code was required to enable oblique axes for displaying the 3D scatter plot in an arbitrary angle.
Since locations in R graphics devices can only be addressed with 2D coordinates, the information on the projection has to be calculated by the 3D graphic functions internally. As described in Section 2, scatterplot3d uses a parallel projection. Since the R graphics device does not know anything about the projection, without any appropriate additional tools it is not possible to add elements into an existing scatterplot3d.

Extensibility of scatterplot3d
In Sections 1 and 2 it was emphasized that the scatterplot3d design was intended to be as general as possible. Some attempts to obtain this generality are described in Section 2 and its subsections. Because of the missing projection information, the ability of adding elements to an already existing scatterplot3d would be restricted, if only the already defined (and for the 2D case general) R functions could be used (cf. Section 3.1).
For this reason, scatterplot3d (invisibly) returns a list of function closures (cf. Section 2.4).
A function closure is a function together with an environment, and an environment is a collection of symbols and associated values (i.e. R variables). Thus these properties of R's scoping rules, called Lexical Scoping (Gentleman and Ihaka, 2000), are extensively used in scatterplot3d. Notice that Lexical Scoping is a feature of R, not defined as such in the S language.
In other words, the values returned by scatterplot3d are functions together with the environment in which they (and the scatter plot) were created. The benefit of returning function closures is, that the function "knows" the values of variables (in the environment) that were assigned to those variables at the time when the function was created. All in all, we made those functions know details about the axis scaling and the projection information that are required to add elements to an existing plot.
The following functions are returned by scatterplot3d: xyz.convert: A function which converts 3D coordinates to the 2D parallel projection of the existing scatterplot3d. It is useful to add arbitrary elements to the plot.
points3d: A function which draws points or lines into the existing plot.
Instead of an intercept, a vector of length three or a (g)lm object can be specified.
box3d: This function draws a box (or "refreshes" an existing one) around the plot.
xyz.convert is the most important function, because it does the parallel projection by converting the given 3D coordinates into the 2D coordinates needed for the R graphics devices. Examples how to use the mentioned function closures are given in Section 4.

Feature demonstration
In this section some of the features of scatterplot3d will be demonstrated using artificially generated data, well known examples from other R functions and the (slightly modified) examples of scatterplot3d's help file. The presentation starts with the latter, each example printed on an individual page to obtain lucidity. For more examples, including an example dealing with mathematical annotation, see Ligges and Mächler (2002).

Helix
In Figure 1 points of a helix are calculated and plotted using the 3D highlighting mode (highlight.3d = TRUE) in a blue box with a light blue grid. We produce the solid look with the point symbol, pch = 20.

RGB color cube
In Figure 2, we visualize the RGB (red-green-blue) color space which R and most computer screens use for color coding. First, we draw all the named colors available in R via colors(). Note that it might be interesting to find a better background color here than white. Optimally it would correspond to an RGB location as far away as possible from all given colors. Second, we show the rainbow() colors in the RGB space. Here we redraw the points on top of the cube, using the points3d() closure which is also the basis of our cubedraw() function.  1,1,0), c(0,1,0),c(0,1,1), c(0,1,0),0) cub <-min + (max-min)* cube01 res3d$points3d (cub[ 1:11,], cex = cex, type = 'b', lty = 1) res3d$points3d (cub[11:15,], cex = cex, type = 'b', lty = 3) } crgb <-t(col2rgb(cc <-colors())) rr <-scatterplot3d(crgb, color = cc, box = FALSE, angle = 24) cubedraw(rr) Rrb <-t(col2rgb(rbc <-rainbow(201))) rR <-scatterplot3d(Rrb, color = rbc, box = FALSE, angle = 24) cubedraw(rR) rR$points3d(Rrb, col = rbc, pch = 16) q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q Figure 2: The RGB color cube. On the left, the named colors in R, i.e., colors(). Note the diagonal of gray tones. On the right, the locations and colors of rainbow(201).

3D barplot
With some simple modifications, it is possible to generate a 3D barplot, as shown in this example. To make the plot look like a barplot, type = "h" is set to draw vertical lines to the x-y plane, pch = " " to avoid plotting of point symbols and lwd = 5 to make the lines looking like bars. Furthermore, instead of three vectors a data frame is given as the first argument to scatterplot3d.

Adding elements
The importance of Lexical Scoping to generate function closures to provide extensibility of scatterplot3d was discussed in Section 3. We give an example for using the invisibly returned functions for the famous (at least for S users) tree data.
After the tree data is loaded, it is plotted by scatterplot3d, and the result is assigned to the variable s3d. The points (in blue) are plotted using type = "h", so one can see their x-y location very clearly.
In the next step, a linear model (assumption: volume depends on girth and height of the trees) is calculated. Furthermore, this lm object is plotted by the returned plain3d function (as part of s3d), and it results in a regression plane. Just for demonstration purposes, in the last step some red colored points (on a imaginary line crossing the plot) are plotted with an asterisk symbol.

Meta-analysis of controlled clinical trials
In the first real world example the data from a project on "Meta-Analysis in Biometry and Epidemiology" is taken. The data set contains the results of 13 placebo-controlled clinical trials which evaluated the efficacy of the Bacillus Calmette-Guérin (BCG) vaccine for the prevention of tuberculosis (TB). An important task in combining the results of clinical trials is to detect possible sources of heterogeneity which may influence the true treatment effect. In the present example, a possible influential covariate is the distance of each trial from the equator, which may serve as a surrogate for the presence of environmental mycobacteria that provide a certain level of natural immunity against TB. Other covariates may be the year the trial was carried out and the allocation scheme of the vaccination (A = alternate, R = random, S = systematic). For more details, especially on the choice of the trials and the meta-analytical methods of combining the results, we refer to Knapp and Hartung (2003) and the references given therein.
In Figure 5 the estimated risks of TB disease are plotted for the vaccinated group and the non-vaccinated group, respectively, in the dependence of the year the trial was carried out, of the absolute distance from the equator and of the allocation scheme. The color represents the precision of the estimated risks. Figure 5 clearly reveals a spatio-temporal trend in the realization of the trials. The former trials were carried out far away from the equator, and in all these trials one can observe an evident superiority of the BCG vaccine for the prevention of TB. Except one, all later trials were realized closer to the equator. In these trials, it is apparently that the estimated risks in the non-vaccinated groups are even rather low and, consequently, cannot graphically be separated from the estimated risks in the vaccinated groups. Finally, it is worthwhile to note that the later trial which was carried out far away from the equator has a relative small estimated risk in the non-vaccinated group compared to the former trials and, hence, does not yield such an evident superiority of the BCG vaccine.

Deep hole drilling
Our second real world example shows phase spaces (Tong, 1993) of the drilling torque of a deep hole drilling process. The data is taken from a project on "Analysis and Modelling of the Deephole-Drilling-Process with Methods of Statistics and Neuronal Networks". More detailed analysis on the data than provided in the following example was done by, e.g., Busse et al. (2001) and Weinert et al. (2001). Figure 6 visualizes the phase spaces of the drilling torques of two deep hole drilling processes, a regular and a chattering one. Obviously the points in the phase space of the chattering process are very systematically scattered, and the range of the data is very different for the two processes. The magnification of the regular process in Figure 7 shows that the points of the regular process are scattered unsystematically. Note that other lags like 10, 20, 100 would produce a similar plot. This indicates a sine wave like relationship in the chattering case.

Other 3D tools in R
At the time of writing scatterplot3d, the function persp() in the base package of R for three dimensional surface plots was available, but there was no way to generate 3D scatter plots in R itself. The data visualization system xgobi (Swayne et al., 1998) provides interactive visualization of multidimensional data, e.g. brush and spin, higher-dimensional rotation, grand tour, etc. The R package xgobi (Swayne et al., 1991; we have to distinguish the visualization system and the package) provides an Interface to xgobi and launches a xgobi process appropriately. ggobi (Swayne et al., 2003) is the next edition of xgobi and available at http://www.ggobi.org. Analogously to xgobi an R package Rggobi (Temple Lang and Swayne, 2001) exists in the Omegahat project (Temple Lang, 2000), http://www.omegahat.org, that allows one to embed ggobi within R and to both set and query the ggobi contents. All in all, ggobi can be loaded dynamically into R (as well as into other software products, in principle), and R into ggobi. This provides interactive, direct manipulation, linked, high-dimensional graphics within R.
The package rgl 2 by Adler and Nenadic (2003) is a portable R programing interface to OpenGL. Its features include, e.g., interactive viewpoint navigation, automatic data focus, up to 8 light sources, alpha-blending (transparency), and environmental effects like fogging. The package djmrgl 3 (Murdoch, 2001) also provides an R interface to OpenGL. A huge collection of useful functions to generate, manipulate and interactively rotate 3D objects is available. Efforts are under way to merge these two packages.
Very recently, the function cloud in the lattice package has been introduced. It is a 3D scatter plot function that works in the lattice (Sarkar, 2002) (and grid (Murrell, 2001)) environment of R, but it is still under development (as well as parts of lattice and grid) and therefore lacks some features (compare the help pages for details). Lattice is an implementation of Trellis Graphics, which is a framework for data visualization developed at the Bell Labs by Becker et al. (1996), extending ideas presented in Cleveland (1993).

Conclusion
In the design (Section 2) of the scatter plot function scatterplot3d emphasis is placed on generality and extensibility (Section 3). These two properties are demonstrated in Section 4 together with the high printout quality. A high printout quality and a homogeneous appearance with respect of any other R (2D) graphics is extremely important for publications and presentations. Thus we recommend to use scatterplot3d particularly for these purposes.