Application of a self-organizing map to select representative species in multivariate analysis: A case study determining diatom distribution patterns across France
Introduction
Biological communities are commonly used as indicators of ecosystem quality. Community structures are determined by many environmental factors in different spatial and temporal scales (Stevenson, 1997, Snyder et al., 2002). Community data are composed of a large number of species collected at many sampling sites at different times. A commonly observed phenomenon in field surveys is that the vast majority of species are represented by low abundance while only a few species are abundant. Preston's canonical log-normal distribution is the most widely accepted formalization of the relative commonness and rarity of species (Preston, 1962, Brown, 1981).
In quantitative community analysis, abundant species are commonly used to interpret patterns of habitat disturbance or ecosystem degradation, whereas rare species are generally excluded from the analysis. Although the effects of rare species are negligible on statistical results, they introduce noise and cause difficulties in data analyses. By removing noise, the more important information is more likely to be detected (McCune et al., 2002). To solve the problems of rare species in community ecology, several different approaches (i.e., down weighting, overweighting and deleting species) are applied depending on researchers' interests (Mante et al., 1995, Mante et al., 1997, Cao et al., 2001, Fodor and Kamath, 2002). This is regarded as a preprocessing stage in data mining. As illustrated in Fig. 1, data mining consists of two main steps, data preprocessing and pattern recognition (Fodor and Kamath, 2002). Preprocessing is often time consuming, yet critical as a first step. To ensure the success of the data mining process, it is important that the features extracted from the data should be representative of the data to be relevant to the issues for which the data are collected.
In community ecology, ordination and classification techniques are commonly used to simplify the interpretation of a complex dataset. However, this purpose is defeated if there are a very large number of variables. A large number of variables in the analysis may be informative to investigators in the exploratory phase of the study, yet it is difficult to point out the major issues contained in the dataset if the ordination diagrams are cluttered by numerous variables (Palmer, 2005). Therefore, it is desirable to reduce the number of variables for multivariate analysis in many cases. However, it is impossible to reduce the number of variables without the risk of losing information. In order to remove variables, one should make sure that ecologically relevant information is retained as far as possible.
Deleting rare species could be a useful way of reducing the bulk of ecological datasets and noise generated without losing much information (McCune et al., 2002). The simplest way to delete rare species is to consider the frequency of species in samples (MJM Software Design, 2000), and to carry out direct or indirect gradient analyses including Principal Component Analysis, Correspondence Analysis, Detrended Correspondence Analysis, Canonical Correspondence Analysis, etc. However, traditional multivariate analyses are generally based on linear principles (James and McCulloch, 1990), and cannot overcome various problems: biases due to complexity and non-linearity residing in datasets, and inherent correlations among variables (Lek et al., 1996, Brosse et al., 1999). Self-organizing map (SOM) (Kohonen, 1982), on the other hand, has been used as an alternative to traditional statistical methods to efficiently deal with datasets ruled by complex, non-linear relationships (Lek et al., 1996, Lek and Guégan, 2000). The SOM, an unsupervised neural network, has been implemented to analyse various ecological data (Lek and Guégan, 1999, Lek and Guégan, 2000, Recknagel, 2003): evaluation of environmental variables (Park et al., 2003a, Céréghino et al., 2003), classification of communities (Chon et al., 1996, Park et al., 2003b, Tison et al., 2005), water quality assessments (Walley et al., 2000), and prediction of population and communities (Céréghino et al., 2001, Obach et al., 2001). The SOM produces virtual communities in a low dimensional lattice through an unsupervised learning process. Input components (i.e., species) could be visualized on a SOM map to show the contribution of each component in the self-organization of the map (Park et al., 2003b). These component planes can be considered as a sliced version of the SOM map and provide a powerful tool to analyze the community structure. But, when we consider a lot of species (i.e., several hundreds or thousands), it is difficult to compare all component planes for all species. It becomes necessary to develop an efficient method to select species for removal.
In this study we propose a computational method to reduce the number of species in datasets with a large number of species without losing much information. The datasets with the reduced number of species were further evaluated in relation to environmental conditions. This approach can contribute to practical ecosystem management in handling huge datasets and would broaden the scope of SOM in mining community data in diverse quantitative ecological studies.
Section snippets
Ecological dataset
From the Cemagref French Diatom Database, 836 samples were extracted. The data had been collected nationwide throughout France (Fig. 2) in summer from 1979 to 2002 according to the NFT 90-354 recommendations (AFNOR, 2000). Diatom species were identified at a 1000× magnification (Leitz DMRD photomicroscope) according to Krammer and Lange-Bertalot (1986, 1988, 1991a, 1991b): examination of permanent slides of cleaned diatom frustules, having been digested in boiling H2O2 (30%) and HCl (35%), and
Patterning samples with a large dataset
Diatom communities consisting of 941 species were patterned through the learning process of the SOM (Fig. 5a). Grey scale hexagons represent the number of samples assigned in each SOM unit in the range of 2 (small white)–22 (large black). The SOM units were further grouped into 11 clusters based on the dendrogram of a hierarchical cluster analysis (Fig. 5b).
The SOM weight vectors were used for the classification of the units. Overall diatom communities were well organized in the SOM map
Discussion
Dimension reduction is required when the data are of a higher dimension than tolerated through long-term or large-scale field survey. The goal of dimension reduction is to find a simplified representation of original data without losing much information. Dimension reduction can be considered in two categories: 1) reduction of the number of features representing a data item (from m items in the original data to n items in the reduced data, n < m) and 2) reduction of the number of basis vectors
Acknowledgements
This work was supported by the EU projects Rebecca (contract number SSPI-CT-2003-502158) and the Euro-limpacs (contract number GOEC-CT-2003-505540).
References (35)
- et al.
The use of artificial neural networks to assess fish abundance and spatial occupancy in the littoral zone of a mesotrophic lake
Ecol. Model.
(1999) - et al.
Spatial analysis of stream invertebrates distribution in the Adour–Garonne drainage basin (France), using Kohonen self organizing maps
Ecol. Model.
(2001) - et al.
Patternizing communities by using an artificial neural network
Ecol. Model.
(1996) - et al.
Dimension reduction techniques and the classification of bent double galaxies
Comput. Stat. Data Anal.
(2002) - et al.
Artificial neural networks as a tool in ecological modelling, an introduction
Ecol. Model.
(1999) - et al.
Application of neural networks to modelling nonlinear relationships in ecology
Ecol. Model.
(1996) - et al.
Modelling population dynamics of aquatic insects with artificial neural networks
Ecol. Model.
(2001) - et al.
Applications of artificial neural networks for patterning and predicting aquatic insect species richness in running waters
Ecol. Model.
(2003) - et al.
Typology of diatom communities and the influence of hydro-ecoregions: A study on the French hydrosystem scale
Wat. Res.
(2005) NFT 90-354: Détermination de l'indice biologique diatomées (IBD)
(2000)