Elsevier

Ecological Informatics

Volume 1, Issue 3, November 2006, Pages 247-257
Ecological Informatics

Application of a self-organizing map to select representative species in multivariate analysis: A case study determining diatom distribution patterns across France

https://doi.org/10.1016/j.ecoinf.2006.03.005Get rights and content

Abstract

Ecological communities consist of a large number of species. Most species are rare or have low abundance, and only a few are abundant and/or frequent. In quantitative community analysis, abundant species are commonly used to interpret patterns of habitat disturbance or ecosystem degradation. Rare species cause many difficulties in quantitative analysis by introducing noises and bulking datasets, which is worsened by the fact that large datasets suffer from difficulties of data handling. In this study we propose a method to reduce the size of large datasets by selecting the most ecologically representative species using a self organizing map (SOM) and structuring index (SI). As an example, we used diatom community data sampled at 836 sites with 941 species throughout the French hydrosystem. Out of the 941 species, 353 were selected. The selected dataset was effectively classified according to the similarities of community assemblages in the SOM map. Compared to the SOM map generated with the original dataset, the community pattern gave a very similar representation of ecological conditions of the sampling sites, displaying clear gradients of environmental factors between different clusters. Our results showed that this computational technique can be applied to preprocessing data in multivariate analysis. It could be useful for ecosystem assessment and management, helping to reduce both the list of species for identification and the size of datasets to be processed for diagnosing the ecological status of water courses.

Introduction

Biological communities are commonly used as indicators of ecosystem quality. Community structures are determined by many environmental factors in different spatial and temporal scales (Stevenson, 1997, Snyder et al., 2002). Community data are composed of a large number of species collected at many sampling sites at different times. A commonly observed phenomenon in field surveys is that the vast majority of species are represented by low abundance while only a few species are abundant. Preston's canonical log-normal distribution is the most widely accepted formalization of the relative commonness and rarity of species (Preston, 1962, Brown, 1981).

In quantitative community analysis, abundant species are commonly used to interpret patterns of habitat disturbance or ecosystem degradation, whereas rare species are generally excluded from the analysis. Although the effects of rare species are negligible on statistical results, they introduce noise and cause difficulties in data analyses. By removing noise, the more important information is more likely to be detected (McCune et al., 2002). To solve the problems of rare species in community ecology, several different approaches (i.e., down weighting, overweighting and deleting species) are applied depending on researchers' interests (Mante et al., 1995, Mante et al., 1997, Cao et al., 2001, Fodor and Kamath, 2002). This is regarded as a preprocessing stage in data mining. As illustrated in Fig. 1, data mining consists of two main steps, data preprocessing and pattern recognition (Fodor and Kamath, 2002). Preprocessing is often time consuming, yet critical as a first step. To ensure the success of the data mining process, it is important that the features extracted from the data should be representative of the data to be relevant to the issues for which the data are collected.

In community ecology, ordination and classification techniques are commonly used to simplify the interpretation of a complex dataset. However, this purpose is defeated if there are a very large number of variables. A large number of variables in the analysis may be informative to investigators in the exploratory phase of the study, yet it is difficult to point out the major issues contained in the dataset if the ordination diagrams are cluttered by numerous variables (Palmer, 2005). Therefore, it is desirable to reduce the number of variables for multivariate analysis in many cases. However, it is impossible to reduce the number of variables without the risk of losing information. In order to remove variables, one should make sure that ecologically relevant information is retained as far as possible.

Deleting rare species could be a useful way of reducing the bulk of ecological datasets and noise generated without losing much information (McCune et al., 2002). The simplest way to delete rare species is to consider the frequency of species in samples (MJM Software Design, 2000), and to carry out direct or indirect gradient analyses including Principal Component Analysis, Correspondence Analysis, Detrended Correspondence Analysis, Canonical Correspondence Analysis, etc. However, traditional multivariate analyses are generally based on linear principles (James and McCulloch, 1990), and cannot overcome various problems: biases due to complexity and non-linearity residing in datasets, and inherent correlations among variables (Lek et al., 1996, Brosse et al., 1999). Self-organizing map (SOM) (Kohonen, 1982), on the other hand, has been used as an alternative to traditional statistical methods to efficiently deal with datasets ruled by complex, non-linear relationships (Lek et al., 1996, Lek and Guégan, 2000). The SOM, an unsupervised neural network, has been implemented to analyse various ecological data (Lek and Guégan, 1999, Lek and Guégan, 2000, Recknagel, 2003): evaluation of environmental variables (Park et al., 2003a, Céréghino et al., 2003), classification of communities (Chon et al., 1996, Park et al., 2003b, Tison et al., 2005), water quality assessments (Walley et al., 2000), and prediction of population and communities (Céréghino et al., 2001, Obach et al., 2001). The SOM produces virtual communities in a low dimensional lattice through an unsupervised learning process. Input components (i.e., species) could be visualized on a SOM map to show the contribution of each component in the self-organization of the map (Park et al., 2003b). These component planes can be considered as a sliced version of the SOM map and provide a powerful tool to analyze the community structure. But, when we consider a lot of species (i.e., several hundreds or thousands), it is difficult to compare all component planes for all species. It becomes necessary to develop an efficient method to select species for removal.

In this study we propose a computational method to reduce the number of species in datasets with a large number of species without losing much information. The datasets with the reduced number of species were further evaluated in relation to environmental conditions. This approach can contribute to practical ecosystem management in handling huge datasets and would broaden the scope of SOM in mining community data in diverse quantitative ecological studies.

Section snippets

Ecological dataset

From the Cemagref French Diatom Database, 836 samples were extracted. The data had been collected nationwide throughout France (Fig. 2) in summer from 1979 to 2002 according to the NFT 90-354 recommendations (AFNOR, 2000). Diatom species were identified at a 1000× magnification (Leitz DMRD photomicroscope) according to Krammer and Lange-Bertalot (1986, 1988, 1991a, 1991b): examination of permanent slides of cleaned diatom frustules, having been digested in boiling H2O2 (30%) and HCl (35%), and

Patterning samples with a large dataset

Diatom communities consisting of 941 species were patterned through the learning process of the SOM (Fig. 5a). Grey scale hexagons represent the number of samples assigned in each SOM unit in the range of 2 (small white)–22 (large black). The SOM units were further grouped into 11 clusters based on the dendrogram of a hierarchical cluster analysis (Fig. 5b).

The SOM weight vectors were used for the classification of the units. Overall diatom communities were well organized in the SOM map

Discussion

Dimension reduction is required when the data are of a higher dimension than tolerated through long-term or large-scale field survey. The goal of dimension reduction is to find a simplified representation of original data without losing much information. Dimension reduction can be considered in two categories: 1) reduction of the number of features representing a data item (from m items in the original data to n items in the reduced data, n < m) and 2) reduction of the number of basis vectors

Acknowledgements

This work was supported by the EU projects Rebecca (contract number SSPI-CT-2003-502158) and the Euro-limpacs (contract number GOEC-CT-2003-505540).

References (35)

  • R. Bellman

    Adaptive Control Processes: A Guided Tour

    (1961)
  • J.H. Brown

    Two decades of homage to Santa Rosalia: toward a general theory of diversity

    Am. Zool.

    (1981)
  • Y. Cao et al.

    Rare species in multivariate analysis for bioassessment: some considerations

    J. N. Am. Benthol. Soc.

    (2001)
  • Carreira-Perpinan, M.A., 2001. Continuous latent variable models for dimensionality reduction and sequential data...
  • R. Céréghino et al.

    Predicting the species richness of aquatic insects in streams using a restricted number of environmental variables

    J. N. Am. Benthol. Soc.

    (2003)
  • European Parliament

    Directive 2000/60/EC of the European Parliament and of the Council establishing a framework for community action in the field of water policy

    Off. J. L

    (2000)
  • F.C. James et al.

    Multivariate analysis in ecology and systematics: panacea or Pandora's box?

    Ann. Rev. Ecolog. Syst.

    (1990)
  • Cited by (0)

    View full text