Processing Ecological Data in R with the mefa Package

mefa is an R package for multivariate data handling in ecology and biogeography. It provides object classes to represent the data coded by samples, taxa and segments (i.e., subpopulations, repeated measures). It supports easy processing of the data along with relational data tables for samples and taxa. An object of class mefa is a project specific compendium of the dataset and can be easily used in further analyses. Methods are provided for extraction, aggregation, conversion, plotting, summary and reporting of mefa objects. Reports can be generated in plain text or LaTex. This paper presents worked examples on a variety of ecological analyses.


Introduction
Fortunately, many packages are available for the analysis of ecological data in R (R Development Core Team 2008), e.g., the ade4 (Thioulouse and Dray 2007), labdsv (Roberts 2007) and vegan (Oksanen et al. 2008) packages among other more standard statistical packages e.g., MASS (Venables and Ripley 2002) and stats (R Development Core Team 2008).These, however, often require the multivariate data in the form of a matrix or data frame.
Extensive ecological data sets (with observations of multiple taxa at multiple locations) are not stored in crosstabulated format because most biodiversity data sets are sparse with matrix fill often lower than 30 percent.The conversion between data formats of biodiversity databases and ecological packages might require substantial work.Of course there are many possibilities for general data manipulation in R (Spector 2008).The Hmisc (Harrell Jr 2008), reshape (Wickham 2007), simba (Jurasinski 2007) and labdsv (Roberts 2007) packages contain functions for converting ecological data in database formats into crosstabulated data matrices, and vice versa.
When the problems are more complex (e.g., a survey spanning across multiple spatial or temporal scales), results are stored in several related data tables.Most multivariate methods require a matrix, whereas response modeling usually require data frame as input.The simultaneous manipulation and checking of the community data matrix and related data frames can be time consuming with the standard tools.
The aim of the mefa R package is to provide standardized computational environment for specialist work in ecology and biogeography by bridging the gap between the data and the analysis, and reducing the time spent with data preprocessing.It provides object classes and methods for convenient manipulation of related data tables and can be used for generating reports in plain text or L A T E X format.
The package name mefa is a short for metafaunistics indicating that data processing is a critical step prior to data analysis.The faunistics part refers to the study of the fauna of some territory or area, while the meta part refers to the procedures (data processing and analysis) beyond the data collection part of the scientific endeavor.Of course, the package intends to be more general than covering only faunistics.It can be useful when dealing with almost all kinds of organisms that can be counted or measured, including microorganisms, fungi and plants as well.
Compared to previous versions (< 2.0; Sólymos 2008), the package has been extensively rewritten to enhance efficiency and speed, and with a focus on methods, but the data model remained the same.A stable version of the package is available under the terms of the General Public License from the Comprehensive R Archive Network (CRAN, http://CRAN.R-project.org/package=mefa),developmental version is available at the R-Forge (http: //mefa.R-Forge.R-project.org/).In this paper I outline the motivation and the general idea behind the package, and I describe the structure of the object classes.A real-world data set is used to demonstrate the methods in the package and the use of objects in further data analysis.

Example dataset
The "Dolina 2007" survey was aimed to discover the soil and litter dwelling macroinvertebrates (land snails, and terrestrial isopods) with special respect to microhabitat characteristics.We have surveyed dolines (sinkholes, karstic depressions) at the Alsó-hegy plateau of the Aggtelek National Park, North Hungary, August, 2007.These dolines were mainly covered by hornbeam and beech, and were 0.5-2 hectares in extent.The dolines contained keystone habitat elements i.e., large pieces of coarse woody debris and different rock formations.
We collected invertebrates from four different microhabitats (Litter, live Wood, Dead wood as an equivalent for coarse woody debris, and Rock1 ) with at least three replicates per microhabitat per doline.We collected 1 L of litter per replicate ("quadrat" method for short) and employed a time restricted search ("time" method, five minutes per replicate) in a 1 m radius around the litter sample location.Litter samples were taken from the bottom of live trees, dead trees and rocks (Kemencei et al. 2008;Vilisics et al. 2008).
Here I will use the data of land snails from the first dolina only, that is provided with the package (dataset dol.count).Samples are named after the microhabitat, the applied method and a number for replication.Snails were identified to species and categorized according to extent of shell decay.Four letter short names were used to refer to species.Live animals and fresh shells constituted the "fresh" group.Whitened, decayed and broken shells constituted the "broken" group.Shell decay stages formed the segments in the data set for further analyses.
A data frame containing microsite characteristics of the samples and the applied methods is provided in the dataset dol.samp.Characteristics (nomenclature, taxonomy, adult shell dimension) of the snail species are in the dataset dol.taxa and are based on (Kerney et al. 1983).The full dataset along with the code showing how it was extracted from the full dataset is available on the Dataverse Network (Sólymos and Kemencei 2008).

Motivation
Load the package and the example data set (showing only the first 16 rows out of 297) This is a typical format for biodiversity datasets, where the samp column represents the observational units (samples).In planned ecological field experiments and surveys these units are expected to be comparable in terms of sampling effort.However, when data came from unplanned field observations, samples are often not comparable but refer to an observation, or set of individuals collected from a given location, by a given person in a given time.
The taxa column refers to the taxonomic identity of the individuals found in a given sample.
The taxonomic resolution might vary due to expert knowledge and development stage of the individuals.When an actual sample did not contain individuals, it is convenient to refer to this situation by a pseudo-species (here, it is named as "zero.pseudo")indicating that 0 mefa: Processing Ecological Data in R individuals were found in the sample coded as LT1 (first replicate from the litter microhabitat collected by timed search).
The count column contains the outcome of the field experiment, the number of individuals of a given taxa (i.e., a species) that were found in a given sample.It is zero when the sample contained no individuals.This notation is only necessary, if we are interested in zero samples (i.e., indication of very low abundances).Functions in the package accept non-integer values, too.
The segm column is used to distinguish segments (subpopulations) within individuals of the same species (nested into samples and species).Here we use "fresh" and "broken" segments to classify individuals based on shell decay.Other common examples for such segments are when distinguishing between males and females, different life stages or age classes.But it can also be used to identify subsets of the data, e.g., in case of a repeated measures experiment, when samples are nested within subsequent sampling period.As an example, data from museum collections can be used in this way to describe data accumulation trends through time (Sólymos and Fehér 2008).
This format is ideal for the storage of the data but not adequate for data analysis.Prior to analysis, we have to crosstabulate the data to get a matrix filled with numeric values and with rows as samples and columns as taxa.Besides the functions table (in package base, for factors; R Development Core Team 2008), there are some other functions in R to do this.For two-way crosstabulation of such data, there is the function cast in the reshape package (Wickham 2007), the mama function in the simba package (Jurasinski 2007) based on the reshape function, and the matrify function in the labdsv package (Roberts 2007).
Complications may arise, however, when we are dealing with three-way crosstabulation of the data (see xtabs in package stats, with formula interface; R Development Core Team 2008) and complex data structures, like the results of a hierarchical sampling design.We have to aggregate the samples into higher level units or the taxa into taxonomic or functional groups.
And we might want to extract a subset of the crosstabulated data along with subsetted tables for samples and taxa at the same time.The way to do it with the mefa package is shown in the next sections.
The zero.count attribute refers to the presence or absence of empty samples in the data set.If empty samples are present, taxa and segment indices of the corresponding rows are set to the value stored in the zero.pseudoattribute.This can be set by the zero.pseudoargument in the function call.Count values of this pseudo species should be 0, the segment value is indifferent (but it can be set as the second element of a character vector supplied as the zero.pseudoargument).
The number of columns in the input data frame may vary from two to four.If two columns are provided, it is assumed that the first column contains sample, while the second taxa names.If three columns are provided, the first two are treated as sample and taxa names, while the third is treated as count if numeric, and segment if character or factor.If four columns are provided, those are assumed to be in the samples, taxa, counts, segments order.
The tables are subsetted and ordered according to the row and column names in the crosstabulation.If names do not match, an error message is produced.This can be the case e.g., if the input object for the mefa function is a matrix without row or column names.It is not necessary to have names, only if tables for samples and taxa are provided.
An object of class 'mefa' is a list of length five.The first element is the function call ($call), second is a matrix with the cross tabulated data ($xtab).These two are always present, the remaining three depend on the availability of data and the aims of the study.The third element contains a list of matrices for segments ($segm), the fourth and fifth contain data frames with data for samples ($samp) and taxa ($taxa), respectively.The general structure of an object of class 'mefa' follows a relational data model, where dimnames attributes of the main count data matrix, the segment data matrices and the samples/taxa data tables for the rows and columns are identical.The print method for the 'mefa' objects shows the names of these elements to help memorizing the notation.Figure 1 depicting the object structure can be called as: R> mefalogo()

Summary and graphical display
Dimensions and dimension names can be retrieved by the dim and dimnames methods: R> dim(m2) [1] 24 28 2
This kind of melting-refreezing can be useful, if e.g., segments were not defined in the original 'stcs' object, or the 'mefa' object was based on a matrix or data frame (without segments).In this way, we can create segmented representations based on groups of samples (e.g., spatial units, repeated measures) or groups of taxa (e.g., taxonomic or functional groups) to more easily explore their differences.
Here we used melting-refreezing to explore the effects of sampling methods (note, originally shell decay stages were used as segments, and not sampling methods).We can visualize the relationship by the image method.This method creates a grid of colored rectangles with colors corresponding to the values in count data table.Segments can be specified as well.The rows and the columns are ordered according to their sums, being the top-left corner with the highest values.Transformations (log, order-scaled values based on quantiles, presenceabsence) are also available to best represent the data in the plot.Figure 4 shows differences between the sampling methods compared to the main data matrix.

Extraction and aggregation
If we want to analyze only a subset of the data, the extraction of the 'mefa' objects can be done via square brackets and by indexing the samples, taxa and segments to extract.Indexing can be numeric, or can refer to dimnames: R> ex1 <-m2[1:20, 11:15, "fresh"] R> dim(ex1) [1] 20 5 1 R> dim(ex1$samp) [1] 20 2 R> dim(ex1$taxa) [1] 5 4 As a result, the count data and the samples/taxa tables were subsetted in the same step.If negative numeric values are given, the respective parts of the object are omitted, e.g., the first element in the second dimension can be excluded by m2[, -1].Unused factor levels in the $samp and $taxa tables can be dropped by the argument drop = TRUE: [1] "time" "quadrat" R> ex3 <-m2[m2$samp$method == "time", drop = TRUE] R> levels(ex3$samp$method) [1] "time" The data set can be easily aggregated if we want to calculate statistics, e.g., species richness, on different levels of the sampling hierarchy (like in additive diversity partitioning ;Lande 1996;Crist et al. 2003).Aggregation can be done by internal vectors, when the name of the variable refers to a column in the samples or taxa tables.If external vectors (that are not part of the 'mefa' object) are used, the current implementation require a class attribute (e.g., "factor") to be recognized as an external object.Now we create a factor with levels "large" and "small" for snail species with adult body size larger or equal to, or smaller than 5 millimeters, respectively3 .Then, aggregate the object according to the internal "microhab" (microhabitat type) variable for samples and this external body size factor for taxa: R> size.5 <-as.factor(is.na(m3$taxa$size)| m3$taxa$size < 5) R> levels(size.5)<-c("large", "small") R> m4 <-aggregate(m3, "microhab", size.From these results it is clear, that the proportion of size categories differs among sampling methods and microhabitats.We will further analyze this in the next chapters.
The methods provided for the object classes in the mefa package are reviewed in Table 1.

Writing reports
If the count data as the result of a field experiment are given, along with information about the sampling locations, making a report is easy with the report method.This writes a text file into the specified or the working directory.Contents of this text file can be further used e.g., by copy-pasting it into a word processor.But it is more straightforward to use the = TRUE argument to write a formatted L A T E X file and use the L A T E X function input to include the results into the main document.A sample document on how to use the function in a Sweave document (Leisch 2002) can be viewed by calling mefadocs("SampleReport").Now we write the dataset into a file after extracting the ten randomly selected species only, retaining all samples and both segments: R> report(m5, "report.tex",tex = TRUE, segment = TRUE, + taxa.name= 1, author.name= 2, drop.redundant= 1) The meaning of the arguments used is: use the m5 'mefa' object to write a report into the file "report.tex"by using L A T E X formatting.Use segments.Taxa names are in the first, author names are in the second column of the taxa table.We also want that implicitly all columns in the sample table should be used in the same order as before, to generate sample information (this can be specified by the samp.varargument not used here).Redundant levels of these sample information are dropped here (see help page ?report.mefa for further details).The result is shown in Figure 5 with appropriate L A T E X formatting.The formatting rules can be modified via the tex.control argument.
Vallonia costata (O.F. Muller, 1774) rock, quadrat (fresh: 2).The abundance of snails in the samples was positively associated with dead wood and rock microhabitats.The interaction of sampling method and microhabitat has proved to be significant, indicating that the collecting efficiency of snails varied among microhabitats.The number of snails collected by the quadrat method was higher for the litter and live wood microhabitats, while in the dead wood microhabitat, the timed search resulted in more snails.
We can take advantage of the segments in the analysis as well.We now investigate, whether the proportion of fresh shells (including living animals) differ among microhabitats and sampling methods.First, we make a two-column matrix with the number of fresh shells per samples and the total number of shells (fresh and broken) collected per samples.Then, assuming that decay status (fresh or broken) of a single shell follows a Bernoulli process, we use this matrix in a binomial GLM (logistic regression) with the number of trials equal to the total number of individuals per sample: R> prop.fr<-cbind(summary(m2[ , , "fresh"])$s.abu,summary(m2)$s.abu)R> mod.fr <-glm(prop.fr~.^2, data = m2$samp, family = binomial) R> summary(mod.fr)It is clear that the proportion of fresh shells is not constant among microhabitats and sampling methods.The proportion was lowest in the rock microhabitats, although the quadrat method resulted more fresh shells from the rock microhabitats than the timed search method (due to the significant microhabitat × method interaction).

Multivariate response modeling
We employ a non-parametric multivariate analysis of variance to assess the effects of covariates on community composition.The method is implemented in the adonis function (McArdle and Anderson 2001;Anderson 2001) of the vegan package (Oksanen et al. 2008).The method is based on a distance matrix calculated from the community data by the Bray-Curtis index of dissimilarity, by default.Thus, the input data matrix should not contain rows with zero sum.We have to remove those samples from the count data matrix and the samples table as well.This is most easily done by the as.mefa method with using the argument drop.zero= TRUE: R> library("vegan") R> m6 <-as.mefa(m2,drop.zero= TRUE) R> m6.ado <-adonis(m6$xtab ~.^2, data = m6$samp, permutations = 100) R> m6.ado Call: adonis(formula = m6$xtab ~.^2, data = m6$samp, permutations = 100) The results revealed that 18 % of the total variation in community composition was due to sampling method (see also Figure 4), and 24.6 % was explained by microhabitats.Their interaction was also significant, explaining 15.4 % of the total variation.But the unexplained variance remained relatively high (42 %).
We can visualize this relationship by constrained (canonical) correspondence analysis (CCA).For this, use the cca function from the vegan package (Oksanen et al. 2008).We base the analysis only on the matrix of the "fresh" segment (note that here we use it as data frame) and use the samples table for environmental constraints: R> m2.cca <-cca(m2$segm[["fresh"]] ~., data=m2$samp, + subset=rowSums(m2$segm[["fresh"]]) > 0) R> plot(m2.cca) Most of the species tend to be associated with dead wood and rock microsites, and samples separate well according to sampling method (Figure 6).

Analyzing multiple subsets
The modular structure of the 'mefa' objects enables easy processing of the data in loops.Now we use hierarchical cluster analysis based on multiple subsets of the data to assess the effects of shell decay stages and sampling methods on the community composition.We first make a list with four subsets of the Dolina dataset, referring to combinations of timed search and quadrat method, and fresh and broken shells.We also aggregate the samples over microhabitats: R> m.list <-list() R> n1 <-rep(c("time", "quadrat"), each = 2) R> n2 <-rep(c("fresh", "broken"), 2) R> n3 <-paste(n1, n2, sep=".")R> for (i in 1:4) m.list [[n3[i]]] <-+ aggregate(m2[m2$samp$method == n1[i], , n2[i]], "microhab") Then we do the clustering on each elements of the list object m.list using Euclidean distance and Ward's method: R> for (i in 1:4) { + tmp <-hclust(dist(m.list[[i]]$xtab),"ward") + plot(tmp, main = LETTERS[i], sub = names(m.list)[i],xlab = "") + } Figure 7 shows that the dendrograms for the fresh and broken segments of the quadrat method are congruent, while that for the timed search are not.For the fresh shells, the dead wood communities resembled the rock community.While for the broken shells, dead wood communities were more similar to live wood and litter communities than to the rock.

Conclusions
The mefa package provides standardized environment and a and convenient tool for ecologists and biogeographers working on biodiversity data sets in R. The object classes ('stcs' and 'mefa') and S3 methods provide a coherent framework for data preprocessing of structured data sets.Based on 'mefa' objects, reports can be generated in plain text or L A T E X format.It was presented how the objects can be directly used in further analyses for a variety of ecological problems.

Acknowledgments
Initial development of the mefa package was motivated by complex sampling designs of landscape scale invertebrate surveys, especially the "Dolina 2007" experiment.The author would like to thank to Zita Kemencei, Ferenc Vilisics, Elisabeth Hornung, Roland Farkas, Zoltán

Figure 1 :
Figure 1: Representation of the relational data model in an object of class 'mefa'.

Figure 2 :
Figure 2: Plotting an object of class 'mefa': barcharts for species richness (A), and ranked log abundances for species (B).

Figure 3 :
Figure 3: Boxplots to depict differences among segments in abundances for samples (A), frequency of occurrences for taxa (B).

Figure 4 :
Figure 4: Levelplot representation of an object of class 'mefa'.Intensity of red coloring refers to increasing log abundances in the cells of the main data matrix (A) and the segments for the time restricted search (B), and the quadrat (C) methods.

Figure 6 :
Figure 6: Compound results of the constrained correspondence analysis based on a 'mefa' object.

Figure 7 :
Figure 7: Hierarchical cluster analysis of multiple subsets of the Dolina dataset.Subsets are extracted to represent combinations of shell decay stages (A, C: fresh; B, D: broken) and sampling methods (A-B: timed search; C-D: quadrat method).
If we do not want to use segments, use the segment = FALSE argument: segments represent e.g., successive sampling periods (years, decades) the use of the nested = TRUE argument can be convenient to inspect the data accumulation over time.In this case, values in subsequent segments are added up and segment names indicate this also: R> mefa(x1, nested = TRUE) we also have a table containing variables related to the species (three factors, one numeric, note that size is based on average adult shell dimension and taken from the literature):

Table 1 :
Description of methods provided for the object classes.

table :
Figure 5: Report generated from a 'mefa' object by the report method.