Fast and Robust Bootstrap for Multivariate Inference : The R Package FRB

We present the FRB package for R, which implements the fast and robust bootstrap. This method constitutes an alternative to ordinary bootstrap or asymptotic inference procedures when using robust estimators such as S-, MMor GS-estimators. The package considers three multivariate settings: principal components analysis, Hotelling tests and multivariate regression. It provides both the robust point estimates and uncertainty measures based on the fast and robust bootstrap. In this paper we give some background on the method, discuss the implementation and provide various examples.


Introduction
In this paper we introduce the FRB package for R (R Core Team 2012) which implements robust multivariate data analysis with inference based on the fast and robust bootstrap (FRB) method of Salibian-Barrera and Zamar (2002).The package is available from the Comprehensive R Archive Network (CRAN) at http://CRAN.R-project.org/package=FRB.Three multivariate settings are considered: 1. principal components analysis; 2. hypothesis testing for multivariate means; and 3. multivariate linear regression.The settings have in common that the classical analysis methods are easily robustified by the use of multivariate robust estimates of the type of S-, MM-or GS-estimates.These specific estimates allow the FRB to be applied in order to extend the robust point estimates with accompanying standard errors, confidence intervals or p values.We first give a short overview of the three concerned settings.

Hotelling T 2 tests
Consider again a sample X n = {x 1 , . . ., x n } ⊂ IR p from a p-variate distribution G with center µ and scatter matrix Σ.The one-sample Hotelling test is the standard tool for inference about the center µ.With X n and S n denoting the sample mean and sample covariance matrix, the hypothesis H 0 : µ = µ 0 is tested via the Hotelling T 2 statistic: Now, consider two samples X (1) n 1 and X (2) n 2 from p-variate distributions G 1 and G 2 with respective centers µ 1 and µ 2 and common scatter matrix Σ.Let X (j) n j denote the sample mean of the j-th sample and let S p n be the pooled sample covariance matrix.Then, the two-sample Hotelling statistic n 2 ) S p n −1 (X (1) n 2 ), can test the hypothesis H 0 : µ 1 = µ 2 .In both the one-and two-sample case, and under the assumption of underlying normality, the null distribution of the T 2 statistic is a multiple of an F -distribution.

Multivariate linear regression
Consider a sample of the form Z n = {(y 1 , x 1 ) , . . ., (y n , x n ) } ⊂ IR q+p .The multivariate linear regression model is given by where B is the p × q matrix of regression coefficients.It is assumed that the q-variate error vectors i are independent and identically distributed with zero center and scatter matrix Σ .The interest usually lies in the coefficient matrix B which classically is estimated through least squares.
It is well known that sample means, sample covariances and least squares estimates in linear regression are very sensitive to outliers in the data.Hence, statistical inference based on such estimates can be severely distorted, even by a few atypical observations.In the last few decades a considerable number of alternative multivariate estimators have been proposed, which were designed to be robust against outliers (see e.g., Maronna and Yohai 1998;Maronna, Martin, and Yohai 2006;Hubert, Rousseeuw, and Van Aelst 2008).A primary measure of robustness is the breakdown point of the estimator.It is roughly defined as the minimum fraction of observations in the data that would need to be replaced by arbitrary values in order to arbitrarily change the original estimate.Intuitively, it reveals the maximum fraction of outliers that the estimator can withstand.The classical estimators mentioned above have a breakdown point of 0% (asymptotically), whereas so-called high-breakdown robust estimators can go up to 50%.
Among the range of available robust estimators, we here focus on the class of S-estimators (Rousseeuw and Yohai 1984;Davies 1987) and the related classes of MM-estimators (Tatsuoka and Tyler 2000) and GS-estimators (Croux, Rousseeuw, and Hössjer 1994;Roelant, Van Aelst, and Croux 2009).These classes of estimators succeed in combining a high degree of robustness with relatively good efficiency properties.The second main reason to opt for these estimators is that they fall into a framework that allows the application of the FRB method.
Indeed, when aiming beyond point estimation, that is, when one is also interested in standard errors, confidence intervals and hypothesis tests, the use of robust estimators poses difficulties.While the least-squares normal-theory is well established, the available theory for robust estimators is limited to asymptotic results, often requiring quite stringent assumptions.Resampling methods such as bootstrap provide an interesting alternative, but are hampered by two specific problems.The first of these (and often the most serious) is the computational complexity of robust estimators.All affine equivariant high-breakdown estimators require time-consuming algorithms to find a sufficiently accurate approximation (exact computation is usually even not feasible).Hence, resampling is often not practical, especially for large data sets.The second problem is the instability of the bootstrap in case of outliers: due to possible outlier propagation in the resamples, there is no guarantee that inference based on the bootstrap is as robust as the estimate in the original sample itself.
The FRB method was first introduced in the context of univariate regression MM-estimators by Salibian- Barrera and Zamar (2002), and later generalized to multivariate settings by Van Aelst and Willems (2005) and Salibian-Barrera, Van Aelst, and Willems (2006).It is based on the fact that S-, MM-or GS-estimators can be represented by smooth fixed point equations which allows us to calculate only a fast approximation of the estimates in each bootstrap sample.Hence, the computation is much easier than for ordinary bootstrap.Furthermore, stability is ensured since observations that were downweighted in the original sample will automatically be downweighted in the bootstrap samples as well.
The primary aim of the FRB package is to provide a software implementation of the FRB method.The package then enables robust multivariate data analysis that includes uncertainty measures such as standard errors, confidence limits and p values.Note however that in the FRB method the bootstrap component is tightly linked to the chosen point estimator.Therefore the functions in the FRB package perform both the computation of the point estimates and the subsequent actual FRB procedure.The FRB package thus offers the FRB method instead of the ordinary bootstrap method, but it can only be applied to the point estimates that are made available in the package.In this sense its concept differs from e.g., the well-known boot package (Canty and Ripley 2013), which provides only the bootstrap component and can be applied to any estimator of choice (implemented in R).
S and MM-estimators for multivariate location and scatter are already implemented in the R package rrcov (Todorov and Filzmoser 2009).We take advantage of this fast implementation, based on C code, to obtain these point estimates in the FRB package.The rrcov package also offers PCA based on robust covariance estimates, but does not go beyond point estimation.Univariate regression MM-estimators are available in the R package robustbase (Rousseeuw et al. 2012) and have accompanying standard errors and p values based on its asymptotic distribution.The multivariate analogue of MM-regression is not publicly available in R and, to the best of our knowledge, neither are any other robust multivariate regression methods.We may therefore summarize the contributions of the package FRB as follows: -the package makes available the FRB method for robust inference.
-the package is the first to make available robust multivariate regression methods (specifically S-, MM-and GS-estimates).
While the point estimates for multivariate regression are part of the package, its name FRB reflects the essential component of the robust inference provided by the package.
The rest of the paper is organized as follows.Section 2 discusses multivariate S-, MM-and GS-estimators and describes how they can be used to obtain robust data analysis procedures.Section 3 explains the FRB method.Section 4 then covers the implementation of the estimates and the FRB method and gives an overview of the package FRB.Section 5 illustrates the use of the package through various examples, while Section 6 contains some concluding remarks.
(R2) ρ is strictly increasing on [0, c] and constant on [c, ∞) for some finite constant c.Rousseeuw and Yohai (1984) introduced S-estimators in univariate regression.One-sample multivariate S-estimators for location and scatter were later investigated by Davies (1987) and Lopuhaä (1989).Two-sample S-estimators, for robustly estimating two location vectors and a common covariance matrix, were considered by He and Fung (2000).A generalization to multivariate regression was considered in Van Aelst and Willems (2005).Note that the multivariate regression model encompasses the one-and two-sample location-covariance models (as well as the univariate regression model) as special cases.Nevertheless, we use separate definitions here for ease of understanding and because the applications are different for these special settings.The S-estimators can be defined as follows.
One-sample S-estimators.Consider a sample {x 1 , . . ., x n } ⊂ IR p .Then, for a chosen function ρ 0 , S-estimates of location and scatter ( μn , Σn ) minimize |C| subject to among all T ∈ IR p and C ∈ PDS(p).
Here, PDS(p) denotes the set of positive definite symmetric p × p matrices and by |C| we denote the determinant of the square matrix C. The constant b 0 is usually chosen such that b 0 = E Φ [ρ 0 ( x )], which ensures consistency at the normal model.In this paper and in the FRB package we use Tukey biweight ρ-functions, given by ρ(t) = min(t 2 /2 − t 4 /(2c 2 ) + t 6 /(6c 4 ), c 2 /6).The constant c can then be tuned to achieve any given degree of robustness, in terms of breakdown point (between 0% and 50%).
However, tuning ρ 0 involves a compromise since a higher degree of robustness corresponds to a lower Gaussian efficiency for the S-estimator.This trade-off can be avoided by computing a more efficient M-estimator as a follow-up step for the S-estimator, in which the robustly estimated S-scale is kept fixed.The resulting estimators are called multivariate MM-estimators, as introduced by Tatsuoka and Tyler (2000) for the one-sample location-covariance setting.
The definition is straightforwardly generalized to the two-sample or the regression context as explained below (see also Van Aelst and Willems 2011).
In the following, let σn := | Σn | 1/(2p) denote the S-scale corresponding to the S-estimates defined above.Then, based on a function ρ 1 which typically differs from ρ 0 by having a larger tuning constant c, multivariate MM-estimators are defined as follows.
One-sample MM-estimators.Given the S-scale σn , the MM-estimates for location and shape ( μn , Γn ) minimize among all T ∈ IR p and G ∈ PDS(p) for which |G|=1.
Two-sample MM-estimators.Given the S-scale σn , MM-estimates for the location vectors and common shape matrix ( μ1,n , μ2,n , Γn ) minimize Multivariate regression MM-estimators.Given the S-scale σn , the MM-estimates for the coefficients and error shape matrix ( Bn , Γn ) minimize among all B ∈ IR p×q and G ∈ PDS(q) with |G|=1.
The MM-estimate for the scatter matrix is then defined as Σn = σ2 n Γn .The function ρ 1 can be tuned to achieve any given efficiency for the MM-estimates without affecting the breakdown point of the estimates, which is the same as that of the initial S-estimates and thus determined only by ρ 0 .In practice, one usually chooses the constant c 0 in the (biweight) ρ 0 -function for the S-scale that yields the maximal breakdown value of 50%, while c 1 (> c 0 ) in ρ 1 is tuned to additionally achieve 95% Gaussian efficiency.There is a limited cost associated with the M-step, in the sense that MM-estimates of location or regression have a higher maximum bias than S-estimates, if indeed c 1 > c 0 (see Berrendero, Mendes, and Tyler 2007).Salibian-Barrera et al. (2006) give some efficiency comparisons between S-and MM-estimates.
Generalized S-estimators (GS) were introduced by Croux et al. (1994) in univariate regression and are generally more efficient than regular S-estimators.They minimize a robust scale of the differences between residuals rather than of the residuals themselves.By using differences of residuals, the estimation procedure is "intercept-free" and consequently cannot be used to estimate the intercept or location vectors in general.However, the intercept can easily be estimated e.g., by an additional M-step, in which the GS-regression slope coefficients and error scatter are kept fixed.(Roelant et al. 2009) introduced GS-estimators for multivariate regression which is the setting that we consider here as well.
Multivariate regression GS-estimators.Consider again a sample {(y 1 , x 1 ) , . . ., (y n , x n ) } ⊂ IR q+p and the linear regression model (1).Suppose x i = (1, u i ), i.e., an intercept term is included.Then, the GS-estimates for the slope coefficients and error covariance ( Computing the S-, MM-or GS-estimates requires computationally demanding approximative algorithms.Details about their implementation in the FRB package are given in Section 4.1 below.

Robust analysis
In order to obtain statistical procedures which are more resistant to outliers, it generally suffices to replace the classical estimates by robust estimates such as the ones described above, as briefly explained in this section.

Robust principal components analysis
Robust estimates of the population principal components are obtained by taking the eigenvectors of the robust scatter estimates Σn or Σn , instead of those of the sample covariance matrix.This is the plug-in approach to robust PCA, as applied in Salibian-Barrera et al.
(2006) (and see references therein for other authors).

Robust Hotelling T 2 test
Replacing the sample mean(s) and sample covariance in the Hotelling T 2 statistic by the robust one-or two-sample location and covariance estimates (which is again the plug-in approach), yields a robust test statistic.For example, the one-sample robust T 2 based on S-estimates is given by while the two-sample version is For the two-sample test, one can either use the two-sample estimates discussed above, or one can consider one-sample estimates in each group separately and pool the covariance estimates.
In this paper we focus on the first approach (although the FRB package provides both options).The difficulty in any case is to obtain an estimate of the null distribution since the classical F -distribution does not hold anymore.However, bootstrapping can be used as explained in Roelant, Van Aelst, and Willems (2008).

Robust multivariate regression
Robust regression analysis is immediately obtained by using S-, MM-, or GS-estimates instead of the least squares estimates.S-and MM-estimates are nowadays quite routinely used for robust univariate linear regression, see e.g., the R package robustbase (Rousseeuw et al. 2012).
The developments for the multivariate regression setting have only received attention in recent years, see e.g., Roelant et al. (2009) and references therein.
In each of these settings, both classical bootstrap and asymptotic inference have serious disadvantages, as explained above.Therefore, the option to perform FRB is most welcome.
In the next section this method is reviewed.

Fast and robust bootstrap
The fast and robust bootstrap method was introduced by Salibian- Barrera and Zamar (2002).
The idea is to draw bootstrap samples as usual, but instead of computing the actual (algorithm) estimator in each bootstrap sample, a fast approximation is computed based on the estimating equations of the estimator.For example, the following system of fixed-point equations holds for the multivariate regression S-estimator: where In general, let Θn ∈ IR m contain all estimates in vectorized form, for example in case of multivariate regression S-estimates Θn = (vec( Bn ) vec( Σn ) ) .Suppose further that Θn can be represented as a solution of fixed-point equations as where the function g n : IR m → IR m depends on the sample.Given a bootstrap sample, randomly drawn with replacement from the original sample, the recalculated estimates Θ * where the function g * n now depends on the bootstrap sample.As explained above, calculating the robust estimates Θ * n for every bootstrap sample can be a computationally expensive task.Moreover, even though we may assume that the solution to (9) was resistant to outliers, this does not guarantee that we will obtain an equally resistant solution to (10) as g * n is potentially more severely affected by outliers than g n is.Instead of Θ * n , however, we can easily calculate which can be viewed as a one-step approximation of Θ * n starting from the initial value Θn .It can be shown that, under certain conditions, the distribution of Θ * n consistently estimates the sampling distribution of Θn .It is intuitively clear, however, that the distribution of Θ1 * n does not have this property in general.Indeed, the recalculated Θ1 * n typically underestimate the actual variability of Θn , mainly because every bootstrap sample uses the same initial value in the one-step approximation.To remedy this, a linear correction can be applied as follows.Using the smoothness of g n , we can apply a Taylor expansion about Θn 's limiting value Θ, where R n is the remainder term and ∇g n (.) ∈ IR m×m is the matrix of partial derivatives.The remainder term is typically of order O p (n −1 ), and then equation ( 12) can be rewritten as Similarly, for the bootstrap estimates we obtain When g n (.) is essentially a smooth function of means, as in ( 7)-( 8) for example, it is straightforward to show that under certain regularity conditions where the left side should be considered as conditionally on the sample.Now, define the linearly corrected version of the one-step approximation (11) as If ( 14) indeed holds, then ΘR * n will be estimating the same limiting distribution as the actual bootstrap calculations Θ * n , and by ( 13) and (15) both will be consistently estimating the limiting distribution of Θn as desired.For one-sample S-or MM-estimates, a formal consistency proof under relatively mild conditions can be found in Salibian- Barrera et al. (2006), while Roelant et al. (2009) prove consistency for multivariate regression GS-estimates.
Clearly, ΘR * n is much faster to calculate than Θ * n .It is also more resistant against a possibly large number of outlier recurrences in the bootstrap sample, as can be seen by noting that the estimating equations typically involve weighted least squares estimates (or means) and covariances.The weights will be small or even zero for observations detected as outliers.The approximation does not recompute the weights but instead gives each observation its original weight, such that ΘR * n is essentially equally robust as Θn .The principle outlined above can be applied to multivariate S-, MM-and GS-estimators.Application possibilities for bootstrap methods in general are extremely wide, as almost any conceivable statistical estimate can be "bootstrapped" to assess its uncertainty, see e.g., Davison and Hinkley (1997).The method is obviously most useful when the concerned estimate has limited distributional theory, which is the case for robust estimates in general.For the robust estimates considered here, the FRB can do anything that classical bootstrap could do, only much faster and with less instability problems.In Section 4.3 below we list the specific applications that are available in the FRB package.

Point estimates
Computing S-and GS-estimates is far from trivial.Exact computation is practically infeasible because it involves optimizing non-convex objective functions with possibly many local minima, which on top require solving a non-linear equation to evaluate them.Since MMestimates need initial S-estimates, they too are obviously computationally demanding.In practice this type of estimates is usually approximated by algorithms which combine random subsampling with iterations of reweighted least squares (RWLS).Salibian- Barrera et al. (2006) in the context of univariate S-regression presented a very effective version of such an algorithm, called the fast-S algorithm.The algorithm involves the following steps: 1. Randomly select N elementary subsamples on which to compute the least squares solution, such that we have N different starting candidates for the S-estimates.
2. Locally improve each of the N candidates through k steps of RWLS.
3. Evaluate the objective function on each of the N candidates and retain the t best ones.
4. Further locally improve each of these t candidates until convergence and retain the best one.
Typical values for the parameters here would be N = 500, k = 2 and t = 5, although the optimality of these numbers depends on the data (see e.g., Salibian-Barrera, Willems, and Zamar 2008).In particular, N would ideally increase exponentially with the dimension of the data in order to preserve high robustness against outliers.
Generalization of the fast-S algorithm to multivariate settings is straightforward.The algorithm is also easily adapted to compute GS-estimates.Finally, once S-estimates are available, the second part in the MM-estimation requires only one iteration of RWLS.For multivariate location and scatter the rrcov package (Todorov and Filzmoser 2009) provides implementations of the S and MM-estimators using the fast-S algorithm.Starting from rrcov version 1.3-01 the output of these functions provides all information that is needed to perform the bootstrap part of the FRB method corresponding to the point estimates.However, for the two-sample and multivariate regression settings, implementations of these estimators are not yet available.In the FRB package, the functions Sest_twosample() and Sest_multireg() implement the fast-S algorithm in these two settings.They are also respectively called by the functions MMest_twosample() and MMest_multireg(), to compute the S-part of the corresponding MM-estimates.These functions additionally do the final iteratively RWLS part of these estimates.Finally, the function

GSest_multireg()
performs a version of fast-S which is adapted to GS-estimates.In these functions the parameters N , k and t default to the values given above, but can be changed by the user.The tuning of the ρ-functions in the estimates is by default set to obtain a 50% breakdown estimator in all cases.The second ρ-function in case of the MM-estimates is chosen to additionally yield 95% Gaussian efficiency.These settings can all be changed by the user to any sensible values.

Bootstrap distribution estimate
The implementation of the FRB procedure is quite straightforward: 1. Based on the original sample, compute the gradient ∇g n ( Θn ) and the corresponding correction matrix as given in (16).
2. Draw R bootstrap samples as usual.
3. For each bootstrap sample, compute the one-step approximation given by ( 11) and multiply by the correction matrix to obtain the bootstrap recalculations of the estimates.
Note that the correction matrix only needs to be computed once.In the package, the FRB in the various settings is performed by the functions Sboot_loccov(), Sboot_twosample() and Sboot_multireg() for the S-estimates, by the functions MMboot_loccov(), MMboot_twosample() and MMboot_multireg() for the MM-estimates, and by the function

GSboot_multireg()
for the GS-estimates.These ".boot_..." functions require the result from the respective ".est_..." functions as input parameter.Ideally the functions return R recalculated S-, MMor GS-estimates.However, a few problems may sometimes occur with the approximation ( 16), leading to less than R usable recalculations: • in the regression setting: the number of distinct observations with non-zero weight drawn into the bootstrap sample, needs to be sufficient to enable computation of the weighted least squares approximation (11); this is not guaranteed, especially in small samples, although failure is generally rare.
• in the PCA and Hotelling setting: due to the linear correction in ( 16) the FRB approximations to the scatter matrix estimates may lack positive definiteness, leading to awkward results when used to compute e.g., eigenvalues; the frequency of occurrence depends on the dimension and sample size.
If one of these problems occurs, the bootstrap sample is omitted from further calculations.An exception is made in the PCA setting in the rare event that more than 75% of the samples have failed to produce a positive definite scatter matrix.In that case, the make.positive.definitefunction from the corpcor package (Schäfer, Opgen-Rhein, Zuber, Ahdesmäki, Silva, and Strimmer 2012) is used in an attempt to rescue the bootstrap samples.If the attempt is succesful (which it often is but not guaranteed), the bootstrap sample is used and a warning is produced.

Bootstrap applications
Once we have the FRB-based estimate of the sampling distribution of the S-, MM-or GSestimates, we use it to derive several measures of uncertainty.Note that the FRB-estimate can be applied in exactly the same way as one would apply the ordinary bootstrap estimate (which again would be much more time-consuming and less robust).Here we give an overview of the bootstrap applications that are available in the FRB package for the respective settings.
Examples and illustrations are presented in Section 5 below.

Principal components analysis
Salibian- Barrera et al. (2006), while examining the performance of the FRB in robust PCA, considered four ways in which the bootstrap can be useful.We follow their approach in our implementation and provide the following uncertainty measures (in the following let Σn denote either S-or MM-estimates of scatter): • Standard errors and confidence limits for the eigenvalues of Σn , which estimate the eigenvalues of the population scatter matrix Σ.The eigenvalues can also be seen as estimates of the variance in the respective principal components.
• Standard errors and confidence limits for p k = ( k i=1 λ i )/( p i=1 λ i ), where λ i ; i = 1, . . ., p are the ordered eigenvalues of Σn .The statistic p k estimates the percentage of variance explained by the first k robust principal components (k = 1, . . ., p − 1) and is often used to decide how many principal components are retained for further analysis.Confidence limits for p k can give additional information on which to base such a decision.
• A distribution estimate of the angles between the eigenvectors of Σn and those of Σ.
The estimate is given by the empirical distribution of the angles between the bootstrap recalculated eigenvectors of ΣR * n and the eigenvectors of Σn .For example, an eigenvector of Σn which is relatively aligned with its recalculations based on ΣR * n , can be considered an accurate estimate of the corresponding eigenvector of Σ.
• Standard errors and confidence limits for the loadings of the robust principal components, which are the coefficients of the normalized eigenvectors of Σn and which estimate the loadings of the population level principal components.

Hotelling T 2 tests
In general, using the bootstrap to perform hypothesis tests requires some care because one needs to ensure that the resampling is done under conditions that agree with the null hypothesis.However, when a test statistic is pivotal, it suffices to draw ordinary bootstrap samples (that is, with replacement from the original sample).The pivotal assumption is usually reasonable for test statistics such as z-and t-type statistics and their multivariate T 2 variants, see e.g., Davison and Hinkley (1997, p. 139).Hence, following Roelant et al. (2008) we may assume that the distribution of the robust T 2 R in (5) does not depend on the true value of µ 0 .Therefore, the null distribution of T 2 R can be approximated by the distribution of where ( μR * n , ΣR * n ) are the FRB approximations for the location and covariance S-estimates in the bootstrap sample.Similarly, the null distribution of the robust two-sample test statistic (6) can be approximated by the distribution of with analogous notation.For both tests, the 5% critical value is then the 95% empirical quantile of the R recalculated T 2 * R statistics, where e.g., R = 1000.Rather than a single critical value, we consider two main bootstrap results for the Hotelling tests: • An estimated bootstrap p value, naturally obtained as the fraction of the recalculated T 2 * R statistics which exceed the value of the original T 2 R .
• Simultaneous confidence intervals for the components of the true location vector (or difference between the two location vectors), based on the empirical quantiles of T 2 * R , similarly to the classical T 2 -based intervals (Johnson and Wichern 1988, p. 239).

Multivariate linear regression
In the regression model ( 1), the bootstrap is mainly used to provide an uncertainty measure on the estimates of the coefficients in B. Specifically, we use the FRB to provide the following inference: • Standard errors and confidence limits for the coefficient estimates.
• Bootstrap p values corresponding to each coefficient, obtained by exploiting the duality between confidence intervals and hypothesis tests.The p value is defined as the probability p * such that 1 − p * is the smallest coverage level for which the confidence interval would include zero.
Van Aelst and Willems (2005) and Roelant et al. (2009) investigated the performance of FRBbased confidence intervals for B, respectively in case of S-and GS-estimates.Hypothesis tests that concern more than one parameter, such as for comparing nested models, require adapted resampling schemes.Salibian-Barrera ( 2005) investigated such tests in the univariate case.However, they have not yet been considered in the multivariate setting and thus are not available in the FRB package.

Confidence interval methods
In both the PCA and the regression setting we consider bootstrap confidence intervals.There are several well-known methods to compute such intervals from the bootstrap result.We implemented two of these: bias-corrected and accelerated (BCa) intervals, as considered by all FRB references mentioned above, and so-called basic bootstrap intervals (see e.g., Davison and Hinkley 1997, p. 204 and p. 29 respectively).
BCa intervals are defined as follows.Suppose θn is estimating the scalar parameter θ.Let θ * (α) n be the 100α-th percentile of the R bootstrap estimates θ * ,1 n , . . ., θ * ,R n for that parameter.Then, the BCa interval for θ with coverage 1 − 2α is given by where Here Φ is the standard normal cumulative distribution function, z α is the 100α-th percentile of the standard normal distribution, w is the bias-correction parameter and a the acceleration factor.Both w and a need to be estimated from the data.Estimation of w is straightforward, while for a we use empirical influences computed from the theoretical influence function assuming normality (see Davison and Hinkley 1997, p. 209).
The basic bootstrap interval on the other hand, with coverage 1 − 2α, is given by These intervals have been formally shown to be inferior to BCa intervals regarding accuracy.However, they may be more appealing because of their simpler definition and calculation.

R functions overview
The main functions in the FRB package are listed in Table 1.These functions can be called with a formula interface or by providing a dataframe(s) or matrix (matrices).They process the results from the FRB as described in Section 4.3 and produce objects which have print, plot and summary methods.
Figure 1 displays the functional structure of the package, with E standing for either S, MM or GS.Here, a solid arrow from foo1 to foo2 indicates that foo2 is called by foo1, while a dashed arrow would mean that foo1 requires the result of foo2.  the two-sample location-scatter functions are called only in the (two-sample) Hotelling test procedure, while the respective one-sample functions are used both for (one-sample) Hotelling tests and for PCA.
The function Scontrol() allows to change the default settings for the fast-S algorithm.It can directly be used in the input of the main functions such as FRBpcaS().For the GS-estimates the GScontrol() function acts analogously.For MM-estimates, the MMcontrol() function sets the fast-S parameters for the S-part and additionally allows e.g., to change the Gaussian efficiency, which defaults to 95%.

The main functions listed above return objects of respective classes
FRBpca, FRBhot and FRBmultireg.
For each of these classes the following methods exist: plot(), summary() and print().
These will be illustrated in the next section, but we first give an overview.The plot method acts as follows: plot.FRBpca(): The function produces graphs depicting the FRB inference results, essentially as listed in Section 4.3, or in particular: 1. FRB confidence intervals for the eigenvalues or the variances explained by the components.
2. FRB confidence intervals for the cumulative percentage of variance explained by the components.
3. Histograms of the angles between the FRB recalculated components and the original components.
4. FRB confidence intervals for the loadings of the principal components.

plot.FRBhot():
The function produces graphs depicting: 1.The histogram of the FRB recalculated test statistics.
2. FRB simultaneous confidence intervals for the components of the location or difference between locations.

plot.FRBmultireg():
The function depicts histograms for each of the FRB recalculated coefficient estimates with indication of the corresponding confidence intervals.
Moreover, the diagplot() method is available for outlier detection purposes.The plot is based on robust (Mahalanobis-type) distances of the observations.It is thus not related to the FRB, but solely uses the point estimates of the original sample.Therefore, for the multivariate regression methods provided by the FRB package, the diagplot function can also be applied directly on the point estimates.In particular, the following diagnostic plots are presented: diagplot.FRBpca(): Based on the location and covariance estimates, robust distances d i are computed for each observation.E.g., for S-estimates . These are plotted either versus the observation index, or versus a measure of the overall empirical influence that the observation would have on the classical principal components.The latter demands some additional computation time in order to obtain a simulation-based cutoff value for the empirical influences (see Pison and Van Aelst 2004, for details).
diagplot.FRBhot(): Based on the one-or two-sample location and covariance estimates, robust distances are computed for each observation and are plotted against their index (separately for each sample in the two-sample case).
diagplot.FRBmultireg(): Based on the estimates for the regression coefficients and the error covariance matrix, robust distances d i are computed for each residual.E.g., for S-estimates . These are again plotted either versus the observation index, or versus the robust distance of the observation in the covariate space.This last option is the typical diagnostic plot as introduced by Rousseeuw and Van Zomeren (1990) in univariate regression and by Rousseeuw, Van Aelst, Van Driessen, and Agulló (2004) in multivariate regression.The plot makes a distinction between three types of outliers: so-called bad leverage, good leverage or vertical outliers.It demands additional computation time since robust estimates for the location and scatter in the covariate space are required.For these additional computations the function uses the same type of estimate as that was used to produce the FRBmultireg object, with the same breakdown point and control settings.
The summary and print methods give formatted output that resembles well the output of the R functions for the corresponding classical methods.this should make the output easy to understand, even by non-experts in robustness, and facilitates comparison with the output of the classical method.For a large part the formatted output provides the numerical counterpart of the graphical representation produced by the plot method.However, it sometimes provides additional information.For example the function summary.FRBmultireg() lists p values for each regression coefficient, as explained in Section 4.3.

Examples
We now illustrate the use of the FRB package through some examples (a few more examples are available in the documentation of the package).Throughout this section we focus on MM-estimates.The use of S-or GS-estimates would obviously be similar.

Principal components analysis
Our first example concerns the Swiss Bank Notes data (Flury and Riedwyl 1988), which consists of p = 6 measurements on 100 real and 100 forged Swiss 1000 francs bills.We here consider only the forged bills.These data are available in the package through R> data("ForgedBankNotes") Suppose we are interested in the covariance structure of the data.To guard against the influence of possible outliers, robust PCA is advisable.PCA based on MM-estimates with FRB inference is obtained via R> res <-FRBpcaMM(ForgedBankNotes, R = 999, conf = 0.95) or alternatively, by using the formula interface R> res <-FRBpcaMM(~., data = ForgedBankNotes, R = 999, conf = 0.95) Note that we have specified that the number of bootstrap samples should be R = 999 and the confidence intervals should have nominal coverage of 95%.These are also the default settings for the FRB functions in the package.The summary method presents an overview of the results:

R> summary(res)
PCA based on multivariate MM-estimates (bdp = 0.5, eff = 0.95)  The output is intended to be self-explanatory.The confidence intervals shown are of the BCa type, which is the default in all applications.If interested, basic bootstrap intervals can be asked for instead by the command summary(res, confmethod = "basic").The intervals for the loadings are not listed but are available graphically through the plot method, together with the graphical FRB results for the angles and the percentages of explained variance:

R> plot(res)
The result of the plot method consists of various pages of output and the user is prompted before starting each new page.Figure 2 shows the first page, which in the top panel displays the (cumulative) percentage of variance explained by each component and in the bottom panel the variances in absolute terms, which are the eigenvalues.The FRB-based confidence intervals are indicated by the dashed lines.Again, basic bootstrap instead of BCa intervals can be requested by specifying plot(res, confmethod = "basic").
We see in this example that the first two principal components seem to explain more than 80% of the total variation.However, the lower confidence limit is actually below 80%.In general, when selecting the number of components to retain for further analysis on the basis of such percentages, it may be safer to consider the lower limits instead of the estimated percentages.
In Figure 3 we have the second output page of the plot method.It shows for each principal component the histogram of the angles between the original component and the corresponding bootstrap components.The angle between two components is expressed by a value between 0 and π/2.These limits are indicated by thick vertical lines on the histograms.We see for example that the estimate of the first component is very much aligned with its bootstrap versions (most angles are close to zero) indicating low variability of that estimate.For the other components quite some more instability is observed.
The last type of graph presented by the plot method displays the loadings for a given principal component along with FRB-based confidence intervals.By default the first 5 principal components are shown, on separate pages.Figure 4 shows the graph corresponding to the first component.Next to the histogram of the angles, this yields another way of assessing the stability of the component.Note that the loadings are the coefficients of the normalized eigenvectors and hence lie within the interval [−1, 1].We again conclude that the estimate of the first principal component should be quite accurate, since the confidence intervals are relatively small.The three types of graphs produced by the plot method can also be requested through separate functions, called plotFRBvars(), plotFRBangles() and plotFRBloadings() respectively.
The purpose of using robust estimates instead of the classical ones is often twofold: (1) ensuring that the analysis is reasonably well protected against the influence of possible outliers, and (2) detecting actual outliers in the data.For this last purpose, the package provides the diagplot method:

R> diagplot(res)
For objects of class FRBpca it draws by default the diagnostic plot of Pison and Van Aelst (2004), as explained in Section 4.4.The result is shown in Figure 5.The simpler plot of robust distances versus the index can be obtained through diagplot(res, EIF = FALSE).
The horizontal cutoff line, here and in all other diagnostic plots, is drawn at the square root of the 97.5% quantile of the χ 2 p distribution.We notice a total of 15 observations for which the robust distance clearly exceeds the cutoff and hence these should be regarded as outliers.The points also have a large empirical influence measure, which means that they would heavily influence the classical PCA analysis.Note that they did not have much influence on the robust PCA analysis because large robust distances by definition correspond to small weights in the robust estimates.

Hotelling test
Let us now consider the same data to illustrate the one-sample robust Hotelling test.We apply the test to demonstrate again the impact of the 15 outliers in this data set.That is, we formally test whether the robust test rejects the empirical (classical) mean of the data as the hypothesized true location.If so, we can conclude that the outliers severely influenced the empirical mean.We proceed as follows: R> samplemean <-apply(ForgedBankNotes, 2, mean) R> res <-FRBhotellingMM(ForgedBankNotes, mu0 = samplemean) or using the formula interface We see that T 2 R = 128.68 with an FRB-based p value of 0, implying that all of the bootstrap T 2 * R values are smaller than 128.68.Note that by default the number of bootstrap samples is R = 999.In order to learn more about the difference between the robust location estimate and the hypothesized location vector (in this case the non-robust location estimate), one can consider the simultaneous confidence intervals displayed in the output.From these intervals we notice that the difference mainly lies in the variables Bottom and Diagonal.Interpreting these intervals should be somewhat easier in the graphical representation obtained via

R> plot(res)
for which the result is shown in Figure 6.The top panel, first, shows the histogram of the T 2 * R values, which gives a somewhat better idea of the magnitude of the T 2 R value (if the value of T 2 R in the original sample would be below 100, it would be superimposed on the histogram).We may conclude that 128.68 is in fact huge.
The bottom panel then displays the confidence interval for each variable.In particular, each variable is scaled such that the interval has unit length and then centered around the hypothesized location vector mu0.In this way, all intervals can immediately be compared with regard to the significance of the difference between mu0 (horizontal line) and the robust location estimate (bullets).As expected, we see that Diagonal exhibits the largest difference, followed by the Bottom variable.The least significant differences are found for variables Right and Top.Note that these confidence intervals are generally conservative in the sense that the simultaneous confidence level holds for all linear combinations of the location components and here only p of these are considered.
Further inspection of the data would indeed reveal that the 15 outliers have particularly deviating values in the variables Bottom and Diagonal.For example, these 15 bank notes tend to have a shorter diagonal than the rest of the notes.Note that the 15 outliers would show up again in the diagnostic plot obtained through diagplot(res).
For an illustration of the two-sample Hotelling test, we turn to the Hemophilia data (Habemma, Hermans, and Van den Broek 1974), which are also available in the package:

R> data("hemophilia")
This data set contains two measurements on 75 women, belonging to two groups: n 1 = 30 of them are non-carriers and n 2 = 45 are known hemophilia A carriers.We would like to robustly test whether the difference between the two group means is significant.The MM-based test is obtained as follows: R> grp <-as.factor(hemophiliaFor each of the three responses, the output shows the values for the MM-estimates of the regression coefficients with the corresponding FRB standard error and p value for these coefficients.Significance is indicated by the usual codes.We conclude, for example, that the coefficient for occupation is significant for each of the three responses, and the same holds for counseling except for the response selfesteem. The plot method provides a graphical representation of these results.Here we request the FRB results for all explanatory variables except the intercept (by specifying expl = 2:6).
R> plot(res, expl = 2:6) If desired, the user can also specify which response variables to include, as would be useful in case q is large.Figure 9 shows the result of the above request.For each coefficient the bootstrap distribution is shown in a histogram and the confidence limits are superimposed.By default the BCa method is used for the intervals, but as before the confmethod argument can be used in both summary and plot to obtain the basic bootstrap intervals instead.The confidence level is as specified in the call to FRBmultiregMM().The coefficients which are significantly different from zero on this specific level are indicated in the graph by the red color and a star in the corresponding title.For example, the second row of plots stands out because these are the coefficients corresponding to the predictor occupation, which is significant for all three responses.
For outlier diagnostics we again apply the diagplot method.By default, this function first computes the MM-estimates of location and covariance in the space of the explanatory variables, based on which it computes the robust distances in the explanatory space.It then plots the residual distances versus these leverages (see also Section 4.4).The additional computations are time-consuming and can be avoided by setting the argument Xdist = FALSE, in which case the residual distances would simply be plotted against the index of the observations.Here we choose the default option: R> diagplot(res) Figure 10 contains the resulting diagnostic plot.It reveals at least one large outlier, observation 59, which can be categorized as a bad leverage point.Other observations which deserve some attention based on this plot are 12, 21, 33, 35 and 44.We conclude that a classical least squares analysis is likely to be overly influenced by a few outliers and especially by observation 59.Note that the outlier diagnostics can also be obtained without applying the FRB inference, but only based on the point estimates as follows: R> res <-MMest_multireg(cbind(reading, mathematics, selfesteem) ~., + data = schooldata) R> diagplot(res)

Conclusion
In this paper we provided some background on the fast and robust bootstrap method and we introduced the R package FRB for robust multivariate inference.Currently all functions in the package are written in plain R code.To speed up the computations, future work will therefore include replacing some of the code by an implementation in a lower-level language such as C, in particular the fast-S algorithm for multivariate regression.Furthermore, the recently introduced robust MANOVA tests with the FRB method (Van Aelst and Willems 2011) are intended to be added to the package in the near future.Other applications and developments of FRB will be added in updates of the package when they become available.

Figure 1 :
Figure 1: Structure of the FRB package.

Figure 2 :
Figure 2: Bank notes data.Percentage of variance explained with FRB-based confidence intervals.Result of plot method on object of type FRBpca.

Figure 3 :
Figure 3: Bank notes data.Histograms of angles between original and bootstrap estimates of principal components.Result of plot method on object of type FRBpca.

Figure 4 :Figure 5 :
Figure 4: Bank notes data.Loadings of the first principal component with FRB-based confidence intervals.Result of plot method on object of type FRBpca.

Figure 10 :
Figure 10: School data.Diagnostic plot.Result of diagplot method on object of type FRBmultireg.

Table 1 :
Main functions in package FRB.