The ade4 package: Implementing the duality diagram for ecologists

Multivariate analyses are well known and widely used to identify and understand structures of ecological communities. The ade4 package for the R statistical environment proposes a great number of multivariate methods. Its implementation follows the tradition of the French school of "Analyse des Donnees" and is based on the use of the duality diagram. We present the theory of the duality diagram and discuss its implementation in ade4. Classes and main functions are presented. An example is given to illustrate the ade4 philosophy.


Introduction
Since the early work of Goodall (1954) who applied principal component analysis (PCA) to vegetation data, multivariate analyses have been and remain intensively used by community ecologists. Multivariate analysis provides methods to identify and summarize joint relationships of variables in large data sets. Community ecologists usually sample a number of sites and aim to analyze the effects of several environmental factors on several species simultaneously. In this context, the application of multivariate analysis to community ecology is natural, routine and fruitful (Gauch 1982, p. 1). The diversity of ecological questions, models and types of data has induced the development of a great number of multivariate methods. There are at least three R packages, devoted to ecologists and available on the Comprehensive R Archive Network http://CRAN.R-project.org, which implement some of these methods (ade4, labdsv, vegan).
The ade4 package (Data Analysis functions to analyze Ecological and Environmental data in the framework of Euclidean Exploratory methods) is a complete rewrite for the R environment (R Development Core Team 2007) of the ADE-4 (in uppercase) software (Thioulouse, Chessel,  Multivariate methods aim to answer these two questions and seek for small dimension hyperspaces (few axes) where the representations of individuals and variables are as close as possible to the original ones. To answer the two previous questions, we need to define how to compute relationships between variables and differences between individuals. Thus, we define Q, a p × p positive symmetric matrix and D, a n × n positive symmetric matrix. Q is a metric used as an inner product in R p and thus allows to measure distances between the n individuals. D is a metric used as an inner product in R n and thus allows to measure relationships between the p variables.
In practice, the choice for matrices X, Q and D is closely related to the objectives of the study. For an environmental table with only quantitative variables, considering Euclidean distances between individuals leads to Q = I p where I p is the p × p identity matrix. If we assume that relationships between variables are measured by covariances, then X = x ij − m(x j ) (where m(x j ) is the mean for the j-th column of X) and D = 1 n I n . If we prefer to use correlations, then X = (where sd(x j ) is the standard deviation for the j-th column of X) and D = 1 n I n . These two alternatives correspond to PCA on covariance matrix (centered PCA) and PCA on correlation matrix (normed PCA) respectively.
For a floro-faunistic table, several alternatives can be considered. For instance, PCA of species profiles (X = x ij ) removes the effect of the differences in global abundances between species when computing distances between sites. This analysis is focused on the relative distribution of species over the sites and aims to compare the ecological preferences of species.
Various centering of table X can also be used. Mathematically and geometrically, and also ecologically, centering involves a point of reference for the study. Information is given by a site when it deviates from this hypothetical reference site and a species is taken into account if it departs from the reference distribution over all sites. For noncentered data, the point of reference is the all-zero record: an empty site or a species that is always absent. Centering by species (X = [x ij − x ·j ]) implies that the reference point is a hypothetical site where the species composition is simply the mean species composition computed for all sites. Centering by site (X = [x ij − x i· ]) implies that the reference point is an average species for which the abundance in a site is a constant proportion of the species total abundance in this site. To get a more detailed description of these various transformation, the reader could consult Noy-Meir (1973);Noy-Meir, Walker, and Williams (1975); Legendre and Gallagher (2001); Dray, Chessel, and Thioulouse (2003).
Lastly, various definitions of matrices Q and D allow to give more or less weights to species and sites. For instance, setting Q = diag(x ·1 , · · · , x ·p ) allows to weight each species with its global abundance when computing distances between sites. This could be useful if we consider that the samples are not representative of the community. Sampling selectivity can be a reason for this nonrepresentativeness because many species are rare in the sample not because they are rare in the studied area, but because the collecting method is not efficient for capturing them (Bayley and Peterson 2001). In this case, information given by an abundant species is more reliable than that given by a rare species and must have more weight when comparing two sites. Different definitions of matrices X, Q and D correspond to different multivariate methods including PCA (dudi.pca), correspondence analysis (dudi.coa), non-symmetric correspondence analysis (dudi.nsc), multiple correspondence analysis (dudi.acm). . . All these methods correspond to the analysis of a particular triplet (X, Q, D) of three matrices. This information is summarized in the following mnemonic diagram: It should be noticed that (X, Q, D) is strictly equivalent to X , D, Q .

Eigendecompositions
As said before, multivariate methods seek for a small dimension hyperspace where the representation of individuals is as close as possible to the original one. This objective is achieved by the diagonalization of X DXQ. Symmetrically, small dimension hyperspace where the representation of variables is as close as possible to the original one is obtained by the diagonalization of XQX D. Note that these two operators could be non-symmetric. These two different diagonalizations are linked and will be considered in the diagonalization of the duality diagram. In this section, we show the various relationships between these two diagonalizations and others which consider symmetric operators.
We consider the Cholesky decompositions Q = E E (E is q × p) and D = B B (B is g × n). In order to obtain a symmetric operator to diagonalize, we can 'break' the duality diagram:

Diagonalization 2
Let Ω = BXE . The q × q operator Ω Ω = EX B BXE is symmetric and its eigendecomposition (diagonalization 1 in the previous diagram) leads to: where Λ is the diagonal matrix of eigenvalues and V contains the associated eigenvectors as columns.
Let F = E V. After some matrix manipulations, we obtain: After some matrix manipulations, we obtain: X DXQA = AΛ and A QA = I q Symmetrically, the operator ΩΩ = BXE EX B (g × g) is symmetric and its eigendecomposition (diagonalization 2 in the previous diagram) leads to: where Λ is the diagonal matrix of eigenvalues and U contains the associated eigenvectors as columns.
Let G = B U. After some matrix manipulations, we obtain: After some matrix manipulations, we obtain: XQX DK = KΛ and K DK = I g In summary, we have the following general theoretical properties: • Matrices Ω Ω, ΩΩ , X DXQ, DXQX , QX DX and XQX D have the same r nonzero eigenvalues and r ≤ min(n, p, q, g).
• r is called the rank of the diagram and the nonzero eigenvalues λ 1 > λ 2 > · · · > λ r > 0 are stored in the diagonal matrix Λ [r] .

Axes and components
From the previous part, we can see that the diagonalizations of X DXQ (representation of the individuals) and XQX D (representation of the variables) produce the same eigenvalues. There are also some relationships between the different subspaces that are defined by these diagonalizations. We define the following elements related to the eigendecompositions described above: These r columns define the principal factors.
• A = a 1 , · · · , a r is a p × r matrix containing the r nonzero eigenvectors (in column) of X DXQ associated to the r eigenvalues of the diagram. A is Q-orthonormalized i.e. A QA = I r . The r columns define the principal axes.
• K = k 1 , · · · , k r is a n × r matrix containing the r nonzero eigenvectors (in column) of XQX D associated to the r eigenvalues of the diagram. K is D-orthonormalized i.e. K DK = I r . The r columns define the principal components.
• G = g 1 , · · · , g r is a n × r matrix containing the r nonzero eigenvectors (in column) of DXQX associated to the r eigenvalues of the diagram. G is D −1 -orthonormalized: The r columns contain the principal cofactors.
The term duality is justified by the close connections between the four eigendecompositions. Relationships between the four eigendecompositions allow to compute only one system of axes to obtain the three others. For instance, we have the following transition formulas: We summarize this description in the following diagram:

Properties
There are some fundamental properties linked to the diagonalization of a duality diagram.
• If we search for a Q-normalized vector a of R p maximizing XQa 2 D , the solution is unique and is obtained for a = a 1 . The maximum is equal to λ 1 . If we search for another vector Q-normalized vector a of R p maximizing the same quantity under the orthogonality constraint a Qa 1 = 0, the solution is still unique and is obtained with a 2 and the maximum is equal to λ 2 and so on until the last one a r .
In summary, the vectors a 1 , a 2 , . . . , a r successively maximize, under the Q-orthogonality constraint, the quadratic form XQa 2 D .

Principal Axes
F Q −1 F = I r y 9 y 9 y 9 y 9 y 9 y 9 y 9 Components K e % e % e % e % e % e % • The vectors g 1 , g 2 , . . . , g r successively maximize, under the D −1 -orthogonality constraint, the quadratic form X g 2 Q .
• If we search for a pair of vectors a (a Q-normalized vector a of R p ) and k (a D-normalized vector of R n ) which maximize the inner product XQa|k D = X t Dk|a Q , the solution is unique. It is obtained for a = a 1 and k = k 1 and the maximum is equal to √ λ 1 . Under the orthogonality constraint, the results can be extended for the other pairs. These general properties correspond to different statistical criterions in practice. In the case of PCA on covariance matrix (i.e., X = x ij − m(x j ) , Q = I p and D = 1 n I n ), we obtain XQa 2 D = VAR(XQa) and X Dk 2 Q = p j=1 COV 2 (k, x j ). Hence, the analysis maximizes simultaneously the variance of the projection of X onto the principal axes and the sum of the squared covariances between the principal component and the variables of X. Using the Pythagorean theorem, we can show that these two maximizations induce also the minimizations of the distances between the original configurations and those obtained in the subspaces defined by the principal axes and components ( Figure 2).
Note that principal cofactors and factors are not useful in the examples presented in this paper but they could have some interest for some particular diagrams. For instance, in the case of canonical correlation analysis (cancor in package stats) which corresponds to a particular duality diagram, principal factors and cofactors contain the coefficients used to construct linear combination of variables with maximal correlation.

Matrix approximation
We have demonstrated that the analysis of the individuals and the analysis of the variables are closely related. In this section, we show how it is possible to represent on the same plot the individuals, the variables and their relationships. This joint representation is linked to the theory of matrix approximation and is considered as the main output of multivariate methods.
We consider the product KΛ [r] 1/2 A . Using the transition formulas defined above, we obtain: Left-multiplication by K D leads to: The diagonalization of a duality diagram is thus closely linked to the singular value decomposition of X (SVD, Eckart and Young 1936). We denote the row scores L = XQA (i.e. projection of the rows of X onto the principal axes) and the column scores C = X DK (i.e. projection of the columns of X onto the principal components). Using the transition formulas, we demonstrate that L = KΛ

Decomposition of inertia
Multivariate methods provide graphical representations which summarize the data table X. In order to estimate the quality of this representation, additional tools can be used to (1) measure the part played by a variable or an individual in the construction of the representation and (2) evaluate the quality of the representation of each variable and individual in this new subspace. These tools are derived from the notion of inertia.
If a duality diagram contains at least one matrix of weights (i.e. Q and/or D diagonal), it is possible to compute inertia statistics. The inertia of a cloud of points is the weighted sum of the square distances between all the points and the origin. If D is diagonal, we can compute the inertia for the cloud of rows vectors (in R p ). The total inertia of the diagram is equal to: where d ij is the element at the i-th row and j -th column of D and x i is the i-th row of the matrix X. The rows of X can be projected onto a Q-normalized vector a and the projected inertia is then equal to: From the properties of the diagram defined above, it appears that the diagonalization of the diagram consists in finding a set of Q-normalized vector (the principal axes) which maximize the projected inertia. The inertia projected onto the principal axis a k is equal to λ k .
Two inertia statistics are usually used to facilitate the interpretation of results. The absolute contribution measures the contribution by one point to the inertia projected onto one axis. The absolute contribution of the row x i to the axis a k is equal to: The relative contribution (or Cos 2 ) quantifies the contribution of one axis to the inertia of a point. It measures the quality of representation of one point by its projection onto one axis. The relative contribution of the axis a k to the row x i is equal to:

Implementation in R
The theoretical presentation considers that matrices Q and D are positive and symmetric. The duality diagram theory is more general and can consider also one non-positive matrix.
The implementation in the function as.dudi is more restrictive and considers only diagonal matrices for Q and D. However, as shown above, Cholesky decompositions of Q and D allow to easily obtain a diagram with two diagonal matrices. In this section, we show how the duality diagram theory is implemented in the ade4. Usually, the user performs a particular analysis by a call to a 'dudi.*' function. This function contains a call to the as.dudi function and returns an object of the class dudi.
2. Arguments consistency is checked. For instance, df must contains only positive or null values for the dudi.coa function.
3. The three basic elements X, Q and D of the duality diagram are created. X is obtained by an eventual modification of the argument df (e.g., centering and/or scaling of variables for the function dudi.pca). Column (Q) and row weights (D) are computed and stored in two vectors. Note that for some methods, the user can choose its own vectors of weights using the optional arguments col.w and row.w when available.
4. The function as.dudi is called with the three elements defined above as arguments. The duality diagram is diagonalized and a dudi object is returned into the environment of the 'dudi.*' function.
5. Depending on the method, some elements could be added to the dudi object.
6. The dudi object is returned to the environment where the function 'dudi.*' has been called.

The as.dudi function
The function as.dudi is the core of the implementation of the duality diagram in ade4. It has 9 arguments which are described in its help page: R> args(as.dudi) function (df, col.w, row.w, scannf, nf, call, type, tol = 1e-07, full = FALSE) NULL The first part of this function consists in checking arguments consistency: if (!is.data.frame(df)) stop("data.frame expected") lig <-nrow(df) col <-ncol(df) if (length(col.w) != col) stop("Non convenient col weights") if (length(row.w) != lig) stop("Non convenient row weights") if (any(col.w) < 0) stop("col weight < 0") if (any(row.w) < 0) stop("row weight < 0") if (full) scannf <-FALSE Then, the diagonalization of the duality diagram is performed. In order to speed up this step, the function diagonalizes in the smaller dimension. If n > p (i.e. transpose=FALSE), the matrix Ω Ω is diagonalized. If p > n (i.e. transpose=TRUE), ΩΩ is diagonalized: transpose <-FALSE if(lig<col) transpose <-TRUE res <-list(tab = df, cw = col.w, lw = row.w) df <-as.matrix(df) df.ori <-df df <-df * sqrt(row.w) df <-sweep (df, 2, sqrt(col.w When the diagonalization is performed, the user has to choose the number of axes to keep (nf). If the argument scannf is set to its default value TRUE, the screeplot of eigenvalues is displayed in order to facilitate this choice and to reduce the risk of over-or underestimation of the number of axes to interpret.
Principal axes A are then obtained by A = E −1 V and verify A QA = I r . Row scores are then computed by L = XQA. Column scores (C) and principal components (K) are then obtained by rescaling: C = AΛ (1/2) and K = LΛ −(1/2) . An object of the class dudi, which is described in the next section, is returned.

The dudi class
The object returned by the as.dudi function is a list of class dudi. This object stores all the elements related to the diagonalization of a duality diagram. It contains at least these different components: • tab: a data.frame (n rows, p columns) with the modified table X • rw: a vector (length n) of row weights (D) • cw: a vector (length p) of column weights (Q) • eig: a vector (length r) of eigenvalues (Λ) • nf: the number of axes kept • rank: the rank of the duality diagram (r) • l1: a data.frame (n rows, nf columns) with the principal components (K) • c1: a data.frame (p rows, nf columns) with the principal axes (A) • li: a data.frame (n rows, nf columns) with the row scores (L) • co: a data.frame (p rows, nf columns) with the column scores (C) • call: the matched call There are three methods for the dudi class: R> methods(class = "dudi") [1] print.dudi scatter.dudi t.dudi The print.dudi function prints a dudi in a nice way. t.dudi transforms the dudi corresponding to the triplet (X, Q, D) into a new one corresponding to (X , D, Q). The function scatter.dudi provides a graphical representation of a dudi by a simultaneous representation of variables and individuals (i.e. biplot, Gabriel 1971).
There are other functions related to the dudi class: is.dudi tests if an object is of class dudi, inertia.dudi returns inertia statistics, reconst computes the table approximation and redo.dudi recomputes an analysis with a new number of axes.

An example: dudi.hillsmith
In this section, we analyze environmental information of the dune meadow data (Jongman, ter Braak, and Van Tongeren 1987). This data set is available in the ade4 package. Data on the environment and land-use have been sampled in 20 sites: R> data("dunedata") R> sapply(dunedata$envir, class) The variables are: • A1: thickness of the A1 horizon.
• moisture: moisture content of the soil.
• manure: quantity of manure applied.
• use: agricultural grassland use. This variable is coded as an ordered factor with levels hayfields < both < grazing.
The ordered variable use is transformed into a factor: To construct the table X, a quantitative variable is not modified while a qualitative variable with m levels is coded by m dummy variables. If the number of levels for each factor is m 1 , · · · , m p f , the resulting table X has p q + m 1 + · · · + m p f columns.
By default, D = (1/n)I n . Columns weights are computed and stored in Q. If the j -th column of X (denoted x j ) corresponds to a quantitative variable then q jj = 1. If x j corresponds to a dummy variable coding the l -th level of the f -th factor, then q jj = (x j ) Dx j = n f (l) /n where n f (l) is the number of individuals of the l -th level of the f -th factor.
Lastly, table X is modified. If x j corresponds to a quantitative variable, it is scaled to mean (x j ) D1 n = 0 and variance (x j ) Dx j = 1 where 1 n is the unit vector of with n rows. If x j corresponds to a dummy variable, it is transformed into (x j − q jj 1 n )/q jj = (n/n f (l) )x j − 1 n . This analysis seeks for principal axes (a j ) which maximize the quadratic form l j 2 D = XQa j 2 D with orthogonality constraints ((a j ) Qa j = 1 and (a i ) Qa j = 0 for i = j. In other words, the analysis finds coefficients (a j ) to obtain a linear combination of variables (l j = XQa j ) which maximizes l j 2 D = VAR(l j ) = λ j . Simultaneously, the analysis seeks for principal components (k j ) which maximize the quadratic form c j 2 Q = X Dk j 2 Q with orthogonality constraints ((k j ) Dk j = 1 and (k i ) Dk j = 0 for i = j).
If the i-th column of X corresponds to a quantitative variable, the quantity (x i ) Dk j 2 Q is equal to COR 2 (x i , k j ).
If the i-th column of X corresponds to a dummy variable, the quantity (x i ) Dk j 2 Q is equal to (n f (l) /n) · m(k j , f (l)) where m(k j , f (l)) is the mean of k j computed for the individuals of l -th level of the factor f. Summing these quantities for the m f dummy variables of the factor f leads to m f l=1 (n f (l) /n) · m(k j , f (l)) = η 2 (k j , f ) (i.e. a correlation ratio). In other words, the analysis finds principal components which maximizes the sum of squared correlations (for quantitative variables) and correlation ratios (for qualitative variables).
Results of the analysis are summarized on the biplot using the function scatter. By default, the first two principal axes and row scores are represented (Figure 3). Setting the argument permute to TRUE allows to represent principal components and column scores.

R> scatter.dudi(dd1)
The first axis of the analysis discriminates sites with high level of manure which is related to standard farming (sites 1, 3, 4, 12, 13 and 16) from conserved sites (14,15,17,18,19,20). The second axis separates sites with high moisture and A1 horizon (15,14) from dry sites managed as hobby or biological farming (2, 7, 10, 11). The analysis highlights the main environmental variations and provides a synthetic typology of the sites which could be useful for conservation purposes. One could expect to relate this structure to variations in species richness or species composition (gradient analysis). Results obtained by this analysis can also be useful to improve the sampling protocol. For instance, results show that the information given by the thickness of the A1 horizon and the moisture content of the soil is quite redundant. If one wants to reduce the cost of future sampling sessions without losing important information, he could then choose to measure only one of these two variables.

Conclusions
This presentation focuses on the analysis of one table using a dudi.* function. The duality diagram theory is more general and several other methods can also be considered. These include, for instance, methods to take into account a partition of individuals such as discriminant analysis (discrimin) or within-between classes analysis (within and between, Dolédec and Chessel 1987), methods to analyze a pair of tables such as co-inertia analysis (coinertia, Dolédec and Chessel 1994;Dray et al. 2003) or principal component analysis on instrumental variables (pcaiv, Rao 1964;Lebreton, Sabatier, Banco, and Bacou 1991) including redundancy analysis (van den Wollenberg 1977) and canonical correspondence analysis (cca, ter Braak 1986). All these methods are particular duality diagrams and their implementations in ade4 contain a call to the as.dudi function.
The duality diagram theory allows to easily define and compare methods using a well established mathematical framework. It also simplifies the implementation of methods as functions are always based on the same skeleton. The reader could consult Chessel et al. (2004) and Dray, Dufour, and Chessel (2007) for a more detailled description of the contents of the package ade4. It should be noticed that ade4 contains also more than 100 data sets to illustrate the different methods and that theoretical elements as well as ecological illustrations are presented in the pedagogical ressources available to French readers at http://pbil.univ-lyon1.fr/R/enseignement.html.